M3E-Base Model

Property	Value
Parameter Count	102M
Model Type	Text Embedding
Architecture	BERT-based
Languages	Chinese, English
License	Research Only (Non-commercial)

What is m3e-base?

M3E-base is a powerful bilingual embedding model developed by MokaAI, designed to convert text into dense vector representations. The model excels in both Chinese and English text processing, trained on over 22 million sentence pairs across diverse domains including encyclopedias, finance, healthcare, law, news, and academia.

Implementation Details

Built on the RoBERTa architecture, m3e-base employs in-batch negative sampling and contrastive learning techniques. The model was trained on A100 80G GPUs to maximize batch size efficiency, utilizing both massive Chinese datasets and 1.45M English triplets from the MEDI dataset.

768-dimensional output embeddings
Trained on 22M+ sentence pairs
Supports both sentence-to-sentence and sentence-to-passage tasks
Achieves SOTA performance on Chinese text retrieval tasks

Core Capabilities

Bilingual text embedding generation
Semantic similarity computation
Text retrieval and search
Document classification
Question-answer matching

Frequently Asked Questions

Q: What makes this model unique?

M3E-base stands out for its comprehensive training on massive Chinese-English datasets, superior performance in both semantic similarity and retrieval tasks, and seamless integration with the sentence-transformers ecosystem.

Q: What are the recommended use cases?

The model is ideal for Chinese-focused applications with some English requirements, including semantic search, document classification, and similarity matching. For purely multilingual scenarios, OpenAI's ada-002 might be more suitable.

m3e-base