Multilingual-E5-Large
Property | Value |
---|---|
Parameter Count | 560M |
Paper | Multilingual E5 Text Embeddings: A Technical Report |
License | MIT |
Languages Supported | 94 languages |
What is multilingual-e5-large?
Multilingual-E5-Large is a state-of-the-art text embedding model that supports 94 languages and excels at tasks like semantic search, retrieval, and text similarity. Built on XLM-RoBERTa architecture with 24 layers and 1024 embedding size, it was trained through a two-stage process involving contrastive pre-training and supervised fine-tuning on diverse multilingual datasets.
Implementation Details
The model implements a sophisticated training approach combining weak supervision on 1B+ text pairs and supervised fine-tuning on high-quality datasets across multiple languages. It requires specific text prefixing ("query:" or "passage:") for optimal performance and supports integration with popular frameworks like PyTorch and Sentence Transformers.
- Trained on massive multilingual datasets including mC4, CC News, NLLB, and Wikipedia
- Fine-tuned on diverse tasks including MS MARCO, NQ, TriviaQA, and multilingual retrieval datasets
- Achieves state-of-the-art performance on Mr. TyDi benchmark with 70.5% average MRR@10
Core Capabilities
- Text embedding generation for 94 languages
- Semantic search and information retrieval
- Cross-lingual text similarity assessment
- Document clustering and classification
- Bitext mining and parallel text alignment
Frequently Asked Questions
Q: What makes this model unique?
The model combines extensive multilingual support with state-of-the-art performance across various tasks. Its two-stage training process and careful attention to prefix requirements make it particularly effective for real-world applications.
Q: What are the recommended use cases?
The model excels at cross-lingual information retrieval, semantic search, and text similarity tasks. It's particularly suitable for applications requiring multilingual understanding and can be used for clustering, classification, and parallel text mining.