M3E-Large Text Embedding Model
Property | Value |
---|---|
Parameter Count | 340M |
Embedding Dimension | 768 |
Languages | Chinese, English |
License | Research Only (Non-commercial) |
What is m3e-large?
M3E-large is part of the Moka Massive Mixed Embedding model family, representing the largest variant with 340M parameters. It's a sophisticated text embedding model trained on over 22 million sentence pairs, specifically designed for Chinese and English text processing. The model excels at generating high-quality text embeddings for similarity comparison and retrieval tasks.
Implementation Details
Built on the HuggingFace Sentence-Transformers framework, m3e-large utilizes a RoBERTa-based architecture fine-tuned with in-batch negative sampling and contrastive learning. The model achieves state-of-the-art performance on various Chinese NLP benchmarks, including a 0.6231 average accuracy on text classification tasks.
- Trained on 22M+ diverse sentence pairs covering encyclopedic, financial, medical, legal, and academic domains
- Optimized for both sentence-to-sentence (S2S) and sentence-to-passage (S2P) tasks
- Implements efficient in-batch negative sampling using A100 80G GPUs
- Fully compatible with the sentence-transformers ecosystem
Core Capabilities
- High-performance text similarity computation (0.6231 accuracy on classification tasks)
- Robust text retrieval functionality (0.7974 NDCG@10 on T2Ranking)
- Multilingual support for Chinese and English
- Seamless integration with popular frameworks like Chroma and Semantic Kernel
Frequently Asked Questions
Q: What makes this model unique?
M3E-large stands out for its massive training dataset of 22M+ sentence pairs and superior performance on Chinese NLP tasks. It offers a perfect balance between model size and performance, outperforming many competitors including OpenAI's ada-002 on specific benchmarks.
Q: What are the recommended use cases?
The model is ideal for: text similarity comparison, document retrieval systems, duplicate question detection, text classification, and building GPT memory modules. It's particularly well-suited for applications requiring strong Chinese language understanding.