OPEN-SOLAR-KO-10.7B
Property | Value |
---|---|
Parameter Count | 10.7B |
Model Type | Text Generation |
Architecture | Optimized Transformer (Llama-2 based) |
License | Apache 2.0 |
Languages | Korean, English |
Vocabulary Size | 46,592 tokens |
What is OPEN-SOLAR-KO-10.7B?
OPEN-SOLAR-KO-10.7B represents an advanced iteration of the upstage/SOLAR-10.7B-v1.0 model, specifically enhanced for Korean language processing while maintaining English capabilities. Developed by Junbum Lee (Beomi), this model features an expanded vocabulary and comprehensive Korean corpus integration, trained on over 15 billion tokens using publicly available datasets.
Implementation Details
The model leverages an optimized transformer architecture derived from Llama-2, incorporating Group Query Attention (GQA) and supporting context lengths up to 4k tokens. The training utilized a learning rate of 5e-5 and was conducted using exclusively public Korean corpora from AI Hub, Modu Corpus, and Korean Wikipedia.
- Expanded vocabulary from 32,000 to 46,592 tokens
- Significantly improved Korean tokenization efficiency
- Training corpus size: approximately 61GB
- Supports both Korean and English text generation
Core Capabilities
- Efficient Korean text processing with optimized tokenization
- Strong performance on Korean language benchmarks
- Bilingual capabilities in Korean and English
- High accuracy in tasks like NSMC (89.6%) and KoBEST BoolQ (90.2%)
Frequently Asked Questions
Q: What makes this model unique?
The model's key distinction lies in its expanded Korean vocabulary and exclusive use of publicly accessible Korean corpora, making it freely available under the Apache 2.0 license. It demonstrates significantly improved tokenization efficiency for Korean text, reducing token counts by up to 70% compared to the original SOLAR model.
Q: What are the recommended use cases?
The model is particularly well-suited for Korean language processing tasks, including text generation, sentiment analysis, and question-answering. It shows strong performance in various Korean language benchmarks, making it ideal for both academic and commercial applications requiring Korean language capabilities.