llama-3.2-Korean-Bllossom-AICA-5B

Property	Value
Base Model	LLaMA 3.2 (3B)
Parameters	5B
Type	Vision-Language Model + Language Model
Developer	Bllossom Team (MLPLab at Seoultech, Teddysum, Yonsei Univ)
Paper	COLING 2025 (Upcoming)

What is llama-3.2-Korean-Bllossom-AICA-5B?

This is a groundbreaking Korean-English bilingual model that uniquely combines vision-language and pure language capabilities in a single architecture. Built upon LLaMA 3.2, it's the first 3B-based expansion model that can seamlessly switch between visual and text-only tasks while maintaining high performance in both domains.

Implementation Details

The model underwent comprehensive training using virtually all available Korean LLM pre-training data from Huggingface, combined with vision-language datasets from AI-Hub, KISTI AI, and custom instruction tuning data. It demonstrates remarkable versatility in handling both unimodal and multimodal tasks.

Dual-mode functionality with automatic switching based on input type
Enhanced language model performance through visual understanding (20% improvement over base model)
Specialized optimization for Korean OCR, table, and graph interpretation
Selective knowledge reasoning capability for RAG applications

Core Capabilities

Bilingual processing (Korean-English) without performance compromise
Vision-language tasks including image understanding and description
Advanced reasoning with LogicKor scores showing strong performance (Overall: 7.38)
Efficient operation on free Colab GPU (unique for vision-language models)
Commercial usage permitted

Frequently Asked Questions

Q: What makes this model unique?

It's the first LLaMA-based model that successfully combines vision-language and pure language capabilities while maintaining high performance in both modes. It can automatically switch between these modes based on input type, making it highly versatile for various applications.

Q: What are the recommended use cases?

The model excels in Korean OCR applications, document analysis, table/graph interpretation, and general language tasks. It's particularly useful for applications requiring both visual and textual understanding, such as document processing systems, chatbots with image capabilities, and educational tools.