KaLM-embedding-multilingual-mini-instruct-v1

Maintained By
HIT-TMG

KaLM-embedding-multilingual-mini-instruct-v1

PropertyValue
Model Size494M parameters
Base ArchitectureQwen/Qwen2-0.5B
MTEB Score64.74%
C-MTEB Score63.57%
RepositoryHITsz-TMG/KaLM-Embedding

What is KaLM-embedding-multilingual-mini-instruct-v1?

KaLM-embedding-multilingual-mini-instruct-v1 is an advanced multilingual embedding model developed by HIT-TMG. It's built upon the Qwen2-0.5B architecture and has been extensively trained using both weakly-supervised pre-training and supervised fine-tuning approaches. The model specializes in generating high-quality embeddings for multiple languages, with particular strength in instruction-tuned tasks.

Implementation Details

The model implements a sophisticated architecture that leverages the strength of auto-regressive LLMs combined with instruction tuning. It requires transformers>=4.37.0 and can be easily integrated using the sentence-transformers library. The model supports a maximum sequence length of 512 tokens and includes special handling for asymmetric tasks like retrieval, reranking, classification, and clustering.

  • Superior performance on both MTEB (64.74%) and C-MTEB (63.57%) benchmarks
  • Efficient 494M parameter architecture
  • Support for instruction-based embedding generation
  • Normalized embedding output capability

Core Capabilities

  • Multilingual text embedding generation
  • Instruction-tuned task handling
  • Efficient batch processing with customizable batch sizes
  • Support for asymmetric tasks (retrieval, reranking, classification, clustering)
  • Flexible prompt engineering with instruction prefixing

Frequently Asked Questions

Q: What makes this model unique?

The model stands out for its superior performance on multilingual benchmarks, outperforming larger models like multilingual-e5-large and bge-m3, while maintaining a relatively compact size of 494M parameters. Its instruction-tuning capability makes it particularly versatile for various embedding tasks.

Q: What are the recommended use cases?

The model is ideal for multilingual applications requiring high-quality text embeddings, including document retrieval, semantic search, text classification, and clustering tasks. It's particularly effective when instruction-based customization is needed for specific use cases.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.