llama-3.2-Korean-Bllossom-AICA-5B
Property | Value |
---|---|
Base Model | LLaMA 3.2 (3B) |
Parameters | 5B |
Type | Vision-Language Model + Language Model |
Developer | Bllossom Team (MLPLab at Seoultech, Teddysum, Yonsei Univ) |
Paper | COLING 2025 (Upcoming) |
What is llama-3.2-Korean-Bllossom-AICA-5B?
This is a groundbreaking Korean-English bilingual model that uniquely combines vision-language and pure language capabilities in a single architecture. Built upon LLaMA 3.2, it's the first 3B-based expansion model that can seamlessly switch between visual and text-only tasks while maintaining high performance in both domains.
Implementation Details
The model underwent comprehensive training using virtually all available Korean LLM pre-training data from Huggingface, combined with vision-language datasets from AI-Hub, KISTI AI, and custom instruction tuning data. It demonstrates remarkable versatility in handling both unimodal and multimodal tasks.
- Dual-mode functionality with automatic switching based on input type
- Enhanced language model performance through visual understanding (20% improvement over base model)
- Specialized optimization for Korean OCR, table, and graph interpretation
- Selective knowledge reasoning capability for RAG applications
Core Capabilities
- Bilingual processing (Korean-English) without performance compromise
- Vision-language tasks including image understanding and description
- Advanced reasoning with LogicKor scores showing strong performance (Overall: 7.38)
- Efficient operation on free Colab GPU (unique for vision-language models)
- Commercial usage permitted
Frequently Asked Questions
Q: What makes this model unique?
It's the first LLaMA-based model that successfully combines vision-language and pure language capabilities while maintaining high performance in both modes. It can automatically switch between these modes based on input type, making it highly versatile for various applications.
Q: What are the recommended use cases?
The model excels in Korean OCR applications, document analysis, table/graph interpretation, and general language tasks. It's particularly useful for applications requiring both visual and textual understanding, such as document processing systems, chatbots with image capabilities, and educational tools.