SeaLLM-13B-Chat

SeaLLMs

Southeast Asian-focused LLM supporting 10 languages, fine-tuned from Llama-2 with enhanced cultural adaptation and superior performance in non-Latin scripts

Property	Value
Base Model	Llama-2
Languages Supported	10 (including Vietnamese, Indonesian, Thai, Chinese, Khmer, Lao, Burmese, Malay, Tagalog, English)
License	SeaLLMs License
Paper	Technical Report

What is SeaLLM-13B-Chat?

SeaLLM-13B-Chat is a specialized large language model designed specifically for Southeast Asian languages. Built upon Llama-2, it has been extensively pre-trained and fine-tuned to excel in 10 different languages, with particular emphasis on non-Latin script languages like Thai, Khmer, Lao, and Burmese. The model stands out for its cultural adaptation and superior performance compared to ChatGPT-3.5 in many Southeast Asian languages.

Implementation Details

The model implements several innovative technical approaches, including a specialized vocabulary expansion that reduced token compression ratios significantly (e.g., Thai text compression improved from 4.29x to 1.57x). The training process involved multiple stages of pre-training, supervised fine-tuning, and self-preferencing DPO (Direct Preference Optimization).

Expanded vocabulary with ~16K new tokens for SEA languages
Multi-stage training process with dynamic data mixture control
Culturally-adapted safety measures and local compliance
Enhanced tokenization efficiency for non-Latin scripts

Core Capabilities

Outperforms ChatGPT-3.5 in non-Latin Southeast Asian languages
Superior performance in M3Exam benchmark across multiple languages
Enhanced cultural understanding and local context awareness
Improved machine translation capabilities for low-resource languages
Strong safety measures aligned with local cultural norms and regulations

Frequently Asked Questions

Q: What makes this model unique?

SeaLLM-13B-Chat's unique strength lies in its specialized optimization for Southeast Asian languages, particularly non-Latin scripts, while maintaining strong performance in English. It demonstrates superior cultural adaptation and local compliance compared to western-built LLMs.

Q: What are the recommended use cases?

The model is ideal for applications requiring deep understanding of Southeast Asian languages and cultures, including translation, content generation, and educational assistance. It's particularly effective for tasks involving Thai, Khmer, Lao, and Burmese languages, where it shows significant advantages over existing models.