LABahasa 11B

Property	Value
Parameter Count	11.4B
Model Type	Multimodal LLM
Base Architecture	Llama-3.2-11B-Vision-Instruct + Whisper-large
Training Infrastructure	8xH100 GPUs
Training Time	25 hours

What is llama-labahasa-11B?

LABahasa 11B is a sophisticated multimodal language model developed by Meeting.AI and Lintasarta, designed to process text, audio, and image inputs simultaneously. Built on Meta's Llama 3.2 and OpenAI's Whisper architectures, it has been specifically optimized for Indonesian language processing while maintaining strong English language capabilities. The model was trained on a massive 9 billion high-quality bilingual dataset.

Implementation Details

The model employs a feed-forward network to project audio embeddings from the Whisper Large encoder to Llama's input embeddings. This architecture allows seamless integration of multiple input modalities, including text, audio, and images. Training was conducted using BF16 mixed precision to optimize performance and efficiency.

Specialized audio processing using placeholder token <|audio|>
Integration with Llama 3.2's vision features
Enhanced Indonesian language understanding and generation
Multimodal input processing capabilities

Core Capabilities

Superior performance on MMLU (67.2) compared to Qwen2.5-14B
Exceptional Indonesian language understanding (72.2 on id-MMLU)
Multimodal processing of text, audio, and image inputs
Strong mathematical reasoning capabilities (64.5 on Multi-Mathematics)

Frequently Asked Questions

Q: What makes this model unique?

LABahasa 11B stands out for its specialized optimization for Indonesian language processing while maintaining strong English capabilities, combined with true multimodal abilities across text, audio, and image inputs. Its architecture uniquely combines Llama and Whisper models for comprehensive language understanding.

Q: What are the recommended use cases?

The model is ideal for applications requiring multilingual understanding (particularly Indonesian-English), multimodal processing, and complex NLP tasks. It's particularly well-suited for applications involving audio transcription, image understanding, and cross-lingual communication.

llama-labahasa-11B