multilingual-e5-large

Maintained By
intfloat

Multilingual-E5-Large

PropertyValue
Parameter Count560M
PaperMultilingual E5 Text Embeddings: A Technical Report
LicenseMIT
Languages Supported94 languages

What is multilingual-e5-large?

Multilingual-E5-Large is a state-of-the-art text embedding model that supports 94 languages and excels at tasks like semantic search, retrieval, and text similarity. Built on XLM-RoBERTa architecture with 24 layers and 1024 embedding size, it was trained through a two-stage process involving contrastive pre-training and supervised fine-tuning on diverse multilingual datasets.

Implementation Details

The model implements a sophisticated training approach combining weak supervision on 1B+ text pairs and supervised fine-tuning on high-quality datasets across multiple languages. It requires specific text prefixing ("query:" or "passage:") for optimal performance and supports integration with popular frameworks like PyTorch and Sentence Transformers.

  • Trained on massive multilingual datasets including mC4, CC News, NLLB, and Wikipedia
  • Fine-tuned on diverse tasks including MS MARCO, NQ, TriviaQA, and multilingual retrieval datasets
  • Achieves state-of-the-art performance on Mr. TyDi benchmark with 70.5% average MRR@10

Core Capabilities

  • Text embedding generation for 94 languages
  • Semantic search and information retrieval
  • Cross-lingual text similarity assessment
  • Document clustering and classification
  • Bitext mining and parallel text alignment

Frequently Asked Questions

Q: What makes this model unique?

The model combines extensive multilingual support with state-of-the-art performance across various tasks. Its two-stage training process and careful attention to prefix requirements make it particularly effective for real-world applications.

Q: What are the recommended use cases?

The model excels at cross-lingual information retrieval, semantic search, and text similarity tasks. It's particularly suitable for applications requiring multilingual understanding and can be used for clustering, classification, and parallel text mining.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.