LABahasa 11B
Property | Value |
---|---|
Parameter Count | 11.4B |
Model Type | Multimodal LLM |
Base Architecture | Llama-3.2-11B-Vision-Instruct + Whisper-large |
Training Infrastructure | 8xH100 GPUs |
Training Time | 25 hours |
What is llama-labahasa-11B?
LABahasa 11B is a sophisticated multimodal language model developed by Meeting.AI and Lintasarta, designed to process text, audio, and image inputs simultaneously. Built on Meta's Llama 3.2 and OpenAI's Whisper architectures, it has been specifically optimized for Indonesian language processing while maintaining strong English language capabilities. The model was trained on a massive 9 billion high-quality bilingual dataset.
Implementation Details
The model employs a feed-forward network to project audio embeddings from the Whisper Large encoder to Llama's input embeddings. This architecture allows seamless integration of multiple input modalities, including text, audio, and images. Training was conducted using BF16 mixed precision to optimize performance and efficiency.
- Specialized audio processing using placeholder token <|audio|>
- Integration with Llama 3.2's vision features
- Enhanced Indonesian language understanding and generation
- Multimodal input processing capabilities
Core Capabilities
- Superior performance on MMLU (67.2) compared to Qwen2.5-14B
- Exceptional Indonesian language understanding (72.2 on id-MMLU)
- Multimodal processing of text, audio, and image inputs
- Strong mathematical reasoning capabilities (64.5 on Multi-Mathematics)
Frequently Asked Questions
Q: What makes this model unique?
LABahasa 11B stands out for its specialized optimization for Indonesian language processing while maintaining strong English capabilities, combined with true multimodal abilities across text, audio, and image inputs. Its architecture uniquely combines Llama and Whisper models for comprehensive language understanding.
Q: What are the recommended use cases?
The model is ideal for applications requiring multilingual understanding (particularly Indonesian-English), multimodal processing, and complex NLP tasks. It's particularly well-suited for applications involving audio transcription, image understanding, and cross-lingual communication.