KaLM-embedding-multilingual-mini-instruct-v1
Property | Value |
---|---|
Model Size | 494M parameters |
Base Architecture | Qwen/Qwen2-0.5B |
MTEB Score | 64.74% |
C-MTEB Score | 63.57% |
Repository | HITsz-TMG/KaLM-Embedding |
What is KaLM-embedding-multilingual-mini-instruct-v1?
KaLM-embedding-multilingual-mini-instruct-v1 is an advanced multilingual embedding model developed by HIT-TMG. It's built upon the Qwen2-0.5B architecture and has been extensively trained using both weakly-supervised pre-training and supervised fine-tuning approaches. The model specializes in generating high-quality embeddings for multiple languages, with particular strength in instruction-tuned tasks.
Implementation Details
The model implements a sophisticated architecture that leverages the strength of auto-regressive LLMs combined with instruction tuning. It requires transformers>=4.37.0 and can be easily integrated using the sentence-transformers library. The model supports a maximum sequence length of 512 tokens and includes special handling for asymmetric tasks like retrieval, reranking, classification, and clustering.
- Superior performance on both MTEB (64.74%) and C-MTEB (63.57%) benchmarks
- Efficient 494M parameter architecture
- Support for instruction-based embedding generation
- Normalized embedding output capability
Core Capabilities
- Multilingual text embedding generation
- Instruction-tuned task handling
- Efficient batch processing with customizable batch sizes
- Support for asymmetric tasks (retrieval, reranking, classification, clustering)
- Flexible prompt engineering with instruction prefixing
Frequently Asked Questions
Q: What makes this model unique?
The model stands out for its superior performance on multilingual benchmarks, outperforming larger models like multilingual-e5-large and bge-m3, while maintaining a relatively compact size of 494M parameters. Its instruction-tuning capability makes it particularly versatile for various embedding tasks.
Q: What are the recommended use cases?
The model is ideal for multilingual applications requiring high-quality text embeddings, including document retrieval, semantic search, text classification, and clustering tasks. It's particularly effective when instruction-based customization is needed for specific use cases.