multilingual-e5-small

Maintained By
intfloat

Multilingual-E5-Small

PropertyValue
Architecture12-layer transformer with 384-dim embeddings
Authorintfloat
PaperarXiv:2402.05672
Languages Supported100+ languages

What is multilingual-e5-small?

Multilingual-E5-Small is a compact yet powerful text embedding model designed for cross-lingual understanding. Initially based on microsoft/Multilingual-MiniLM-L12-H384, it has been extensively trained on a diverse collection of multilingual datasets totaling over 5 billion text pairs.

Implementation Details

The model employs a two-stage training approach: first, contrastive pre-training with weak supervision across multiple data sources including mC4, CC News, and NLLB, followed by supervised fine-tuning on specific tasks. The architecture features 12 transformer layers and produces 384-dimensional embeddings.

  • Supports text embedding generation for 100+ languages
  • Optimized for retrieval and semantic search tasks
  • Requires "query:" or "passage:" prefixes for optimal performance
  • Maximum sequence length of 512 tokens

Core Capabilities

  • Cross-lingual semantic search and retrieval
  • Document similarity analysis
  • Multilingual question answering
  • Text classification and clustering
  • Demonstrates strong performance on Mr. TyDi benchmark with 64.4% average MRR@10

Frequently Asked Questions

Q: What makes this model unique?

The model's extensive multilingual training on diverse datasets and its efficient architecture make it particularly effective for cross-lingual applications while maintaining a relatively small size.

Q: What are the recommended use cases?

The model excels in cross-lingual information retrieval, semantic search, and text similarity tasks. It's particularly useful for applications requiring multilingual understanding with limited computational resources.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.