Multilingual-E5-Small
Property | Value |
---|---|
Architecture | 12-layer transformer with 384-dim embeddings |
Author | intfloat |
Paper | arXiv:2402.05672 |
Languages Supported | 100+ languages |
What is multilingual-e5-small?
Multilingual-E5-Small is a compact yet powerful text embedding model designed for cross-lingual understanding. Initially based on microsoft/Multilingual-MiniLM-L12-H384, it has been extensively trained on a diverse collection of multilingual datasets totaling over 5 billion text pairs.
Implementation Details
The model employs a two-stage training approach: first, contrastive pre-training with weak supervision across multiple data sources including mC4, CC News, and NLLB, followed by supervised fine-tuning on specific tasks. The architecture features 12 transformer layers and produces 384-dimensional embeddings.
- Supports text embedding generation for 100+ languages
- Optimized for retrieval and semantic search tasks
- Requires "query:" or "passage:" prefixes for optimal performance
- Maximum sequence length of 512 tokens
Core Capabilities
- Cross-lingual semantic search and retrieval
- Document similarity analysis
- Multilingual question answering
- Text classification and clustering
- Demonstrates strong performance on Mr. TyDi benchmark with 64.4% average MRR@10
Frequently Asked Questions
Q: What makes this model unique?
The model's extensive multilingual training on diverse datasets and its efficient architecture make it particularly effective for cross-lingual applications while maintaining a relatively small size.
Q: What are the recommended use cases?
The model excels in cross-lingual information retrieval, semantic search, and text similarity tasks. It's particularly useful for applications requiring multilingual understanding with limited computational resources.