Gemma2-9B CPT Sahabat-AI v1 Base

Property	Value
Parameter Count	10.2B
Model Type	Decoder
Languages	English, Indonesian, Javanese, Sundanese
Context Length	8192 tokens
License	Gemma Community License
Research Paper	SEA HELM Evaluation Paper

What is gemma2-9b-cpt-sahabatai-v1-base?

Sahabat-AI's Gemma2-9B is a continued pre-trained language model specifically enhanced for Indonesian and regional languages. Built upon the Gemma2 9B architecture, it has been trained on approximately 50B tokens of diverse multilingual data, with a special focus on Indonesian, Javanese, and Sundanese content. The model represents a collaborative effort between GoTo Group and Indosat Ooredoo Hutchison to create a powerful multilingual AI system for Southeast Asian languages.

Implementation Details

The model was trained using MosaicML Composer on 32 Nvidia H100 80GB GPUs over 7 days. It utilizes bfloat16 precision and implements a decoupled AdamW optimizer with weight stable decay scheduling. The training process involved a learning rate of 1.0e-5 and a global batch size of 256.

Trained on a diverse dataset including Dolma Refined Web, arXiv, Stack V2, and region-specific content
Implements advanced tokenization using the Gemma-2-9B tokenizer
Supports a context length of 8192 tokens
Achieves state-of-the-art performance on regional language tasks

Core Capabilities

Achieves 64.123% overall performance on SEA HELM benchmark for Indonesian, Javanese, and Sundanese
Shows exceptional performance in Javanese (69.882%) and Sundanese (62.446%) tasks
Maintains strong English language capabilities with 19.62% average score on standard benchmarks
Excels in tasks including Question Answering, Sentiment Analysis, and Translation

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its specialized optimization for Indonesian and regional languages while maintaining strong general language capabilities. It's the result of extensive continued pre-training on a carefully curated dataset that emphasizes local language content.

Q: What are the recommended use cases?

The model is particularly well-suited for applications requiring deep understanding of Indonesian, Javanese, and Sundanese languages, including content generation, translation, and analysis tasks. However, it's important to note that this is a base model that hasn't been aligned for safety, so developers should implement appropriate safety measures before deployment.