Gemma2-9B CPT Sahabat-AI v1 Base
Property | Value |
---|---|
Parameter Count | 10.2B |
Model Type | Decoder |
Languages | English, Indonesian, Javanese, Sundanese |
Context Length | 8192 tokens |
License | Gemma Community License |
Research Paper | SEA HELM Evaluation Paper |
What is gemma2-9b-cpt-sahabatai-v1-base?
Sahabat-AI's Gemma2-9B is a continued pre-trained language model specifically enhanced for Indonesian and regional languages. Built upon the Gemma2 9B architecture, it has been trained on approximately 50B tokens of diverse multilingual data, with a special focus on Indonesian, Javanese, and Sundanese content. The model represents a collaborative effort between GoTo Group and Indosat Ooredoo Hutchison to create a powerful multilingual AI system for Southeast Asian languages.
Implementation Details
The model was trained using MosaicML Composer on 32 Nvidia H100 80GB GPUs over 7 days. It utilizes bfloat16 precision and implements a decoupled AdamW optimizer with weight stable decay scheduling. The training process involved a learning rate of 1.0e-5 and a global batch size of 256.
- Trained on a diverse dataset including Dolma Refined Web, arXiv, Stack V2, and region-specific content
- Implements advanced tokenization using the Gemma-2-9B tokenizer
- Supports a context length of 8192 tokens
- Achieves state-of-the-art performance on regional language tasks
Core Capabilities
- Achieves 64.123% overall performance on SEA HELM benchmark for Indonesian, Javanese, and Sundanese
- Shows exceptional performance in Javanese (69.882%) and Sundanese (62.446%) tasks
- Maintains strong English language capabilities with 19.62% average score on standard benchmarks
- Excels in tasks including Question Answering, Sentiment Analysis, and Translation
Frequently Asked Questions
Q: What makes this model unique?
This model stands out for its specialized optimization for Indonesian and regional languages while maintaining strong general language capabilities. It's the result of extensive continued pre-training on a carefully curated dataset that emphasizes local language content.
Q: What are the recommended use cases?
The model is particularly well-suited for applications requiring deep understanding of Indonesian, Javanese, and Sundanese languages, including content generation, translation, and analysis tasks. However, it's important to note that this is a base model that hasn't been aligned for safety, so developers should implement appropriate safety measures before deployment.