M3E-Base Model
Property | Value |
---|---|
Parameter Count | 102M |
Model Type | Text Embedding |
Architecture | BERT-based |
Languages | Chinese, English |
License | Research Only (Non-commercial) |
What is m3e-base?
M3E-base is a powerful bilingual embedding model developed by MokaAI, designed to convert text into dense vector representations. The model excels in both Chinese and English text processing, trained on over 22 million sentence pairs across diverse domains including encyclopedias, finance, healthcare, law, news, and academia.
Implementation Details
Built on the RoBERTa architecture, m3e-base employs in-batch negative sampling and contrastive learning techniques. The model was trained on A100 80G GPUs to maximize batch size efficiency, utilizing both massive Chinese datasets and 1.45M English triplets from the MEDI dataset.
- 768-dimensional output embeddings
- Trained on 22M+ sentence pairs
- Supports both sentence-to-sentence and sentence-to-passage tasks
- Achieves SOTA performance on Chinese text retrieval tasks
Core Capabilities
- Bilingual text embedding generation
- Semantic similarity computation
- Text retrieval and search
- Document classification
- Question-answer matching
Frequently Asked Questions
Q: What makes this model unique?
M3E-base stands out for its comprehensive training on massive Chinese-English datasets, superior performance in both semantic similarity and retrieval tasks, and seamless integration with the sentence-transformers ecosystem.
Q: What are the recommended use cases?
The model is ideal for Chinese-focused applications with some English requirements, including semantic search, document classification, and similarity matching. For purely multilingual scenarios, OpenAI's ada-002 might be more suitable.