OPEN-SOLAR-KO-10.7B

Property	Value
Parameter Count	10.7B
Model Type	Text Generation
Architecture	Optimized Transformer (Llama-2 based)
License	Apache 2.0
Languages	Korean, English
Vocabulary Size	46,592 tokens

What is OPEN-SOLAR-KO-10.7B?

OPEN-SOLAR-KO-10.7B represents an advanced iteration of the upstage/SOLAR-10.7B-v1.0 model, specifically enhanced for Korean language processing while maintaining English capabilities. Developed by Junbum Lee (Beomi), this model features an expanded vocabulary and comprehensive Korean corpus integration, trained on over 15 billion tokens using publicly available datasets.

Implementation Details

The model leverages an optimized transformer architecture derived from Llama-2, incorporating Group Query Attention (GQA) and supporting context lengths up to 4k tokens. The training utilized a learning rate of 5e-5 and was conducted using exclusively public Korean corpora from AI Hub, Modu Corpus, and Korean Wikipedia.

Expanded vocabulary from 32,000 to 46,592 tokens
Significantly improved Korean tokenization efficiency
Training corpus size: approximately 61GB
Supports both Korean and English text generation

Core Capabilities

Efficient Korean text processing with optimized tokenization
Strong performance on Korean language benchmarks
Bilingual capabilities in Korean and English
High accuracy in tasks like NSMC (89.6%) and KoBEST BoolQ (90.2%)

Frequently Asked Questions

Q: What makes this model unique?

The model's key distinction lies in its expanded Korean vocabulary and exclusive use of publicly accessible Korean corpora, making it freely available under the Apache 2.0 license. It demonstrates significantly improved tokenization efficiency for Korean text, reducing token counts by up to 70% compared to the original SOLAR model.

Q: What are the recommended use cases?

The model is particularly well-suited for Korean language processing tasks, including text generation, sentiment analysis, and question-answering. It shows strong performance in various Korean language benchmarks, making it ideal for both academic and commercial applications requiring Korean language capabilities.