KaLM-embedding-multilingual-mini-instruct-v1.5

Property	Value
Model Size	494M parameters
Base Architecture	Qwen2-0.5B
MTEB Score	64.94
C-MTEB Score	64.13
Author	HIT-TMG

What is KaLM-embedding-multilingual-mini-instruct-v1.5?

KaLM-embedding-multilingual-mini-instruct-v1.5 is an advanced multilingual embedding model that represents the latest iteration in the KaLM-Embedding series. Built upon Qwen2-0.5B, this model has been extensively trained using a combination of weakly-supervised pre-training and supervised fine-tuning approaches, achieving state-of-the-art performance in multilingual embedding tasks.

Implementation Details

The model utilizes the transformers library (requires version ≥4.37.0) and is seamlessly integrated with the sentence-transformers framework. It supports a maximum sequence length of 512 tokens and includes special instruction handling for asymmetric tasks.

Supports normalized embeddings generation
Handles batch processing with customizable batch sizes
Includes instruction-based prompting for specific tasks
Compatible with retrieval, reranking, classification, and clustering tasks

Core Capabilities

Superior performance on MTEB (64.94) and C-MTEB (64.13) benchmarks
Multilingual text embedding generation
Instruction-tuned for various NLP tasks
Efficient processing with batch support
Normalized embedding output

Frequently Asked Questions

Q: What makes this model unique?

This model stands out due to its superior training data quality and comprehensive instruction tuning, achieving better performance than comparable models like multilingual-e5-large and bge-m3 on standard benchmarks while maintaining a relatively compact size.

Q: What are the recommended use cases?

The model excels in multilingual applications including text retrieval, semantic search, document classification, and clustering. It's particularly effective when instruction-based fine-tuning is needed for specific tasks.