gemma2-9b-cpt-sahabatai-v1-base

Maintained By
GoToCompany

Gemma2-9B CPT Sahabat-AI v1 Base

PropertyValue
Parameter Count10.2B
Model TypeDecoder
LanguagesEnglish, Indonesian, Javanese, Sundanese
Context Length8192 tokens
LicenseGemma Community License
Research PaperSEA HELM Evaluation Paper

What is gemma2-9b-cpt-sahabatai-v1-base?

Sahabat-AI's Gemma2-9B is a continued pre-trained language model specifically enhanced for Indonesian and regional languages. Built upon the Gemma2 9B architecture, it has been trained on approximately 50B tokens of diverse multilingual data, with a special focus on Indonesian, Javanese, and Sundanese content. The model represents a collaborative effort between GoTo Group and Indosat Ooredoo Hutchison to create a powerful multilingual AI system for Southeast Asian languages.

Implementation Details

The model was trained using MosaicML Composer on 32 Nvidia H100 80GB GPUs over 7 days. It utilizes bfloat16 precision and implements a decoupled AdamW optimizer with weight stable decay scheduling. The training process involved a learning rate of 1.0e-5 and a global batch size of 256.

  • Trained on a diverse dataset including Dolma Refined Web, arXiv, Stack V2, and region-specific content
  • Implements advanced tokenization using the Gemma-2-9B tokenizer
  • Supports a context length of 8192 tokens
  • Achieves state-of-the-art performance on regional language tasks

Core Capabilities

  • Achieves 64.123% overall performance on SEA HELM benchmark for Indonesian, Javanese, and Sundanese
  • Shows exceptional performance in Javanese (69.882%) and Sundanese (62.446%) tasks
  • Maintains strong English language capabilities with 19.62% average score on standard benchmarks
  • Excels in tasks including Question Answering, Sentiment Analysis, and Translation

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its specialized optimization for Indonesian and regional languages while maintaining strong general language capabilities. It's the result of extensive continued pre-training on a carefully curated dataset that emphasizes local language content.

Q: What are the recommended use cases?

The model is particularly well-suited for applications requiring deep understanding of Indonesian, Javanese, and Sundanese languages, including content generation, translation, and analysis tasks. However, it's important to note that this is a base model that hasn't been aligned for safety, so developers should implement appropriate safety measures before deployment.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.