kanana-nano-2.1b-embedding

kakaocorp

A 2.1B parameter bilingual embedding model optimized for Korean-English text similarity, achieving 65% accuracy on Korean and 51.56% on English benchmarks.

Property	Value
Parameter Count	2.1B
License	CC-BY-NC-4.0
Author	Kakaocorp
Paper	arXiv:2502.18934

What is kanana-nano-2.1b-embedding?

Kanana-nano-2.1b-embedding is a specialized bilingual embedding model designed for effective text similarity and retrieval tasks in both Korean and English. As part of the larger Kanana model series developed by Kakao, it represents a compute-efficient approach to bilingual language modeling, achieving impressive performance particularly for Korean language tasks.

Implementation Details

The model utilizes advanced pre-training techniques including high-quality data filtering, staged pre-training, and depth up-scaling. It's specifically optimized for embedding generation, achieving 65% accuracy on Korean benchmarks and 51.56% on English tasks, outperforming several comparable models in its size range.

Efficient compute architecture optimized for bilingual processing
Specialized for text similarity and retrieval tasks
Implements advanced embedding generation techniques
Supports batch processing through DataLoader functionality

Core Capabilities

Generates high-quality text embeddings for both Korean and English
Supports variable length inputs up to 512 tokens
Provides efficient batch processing capabilities
Optimized for retrieval-based applications

Frequently Asked Questions

Q: What makes this model unique?

The model stands out for its exceptional performance in Korean language tasks while maintaining competitive performance in English, all within a compute-efficient 2.1B parameter architecture. It's specifically designed for embedding generation and retrieval tasks, making it ideal for bilingual applications.

Q: What are the recommended use cases?

The model is best suited for text similarity search, document retrieval, and question-answering systems that require strong bilingual capabilities, particularly in Korean-English contexts. It's optimized for generating embeddings that can be used for semantic search and retrieval tasks.