m3e-base

Maintained By
moka-ai

M3E-Base Model

PropertyValue
Parameter Count102M
Model TypeText Embedding
ArchitectureBERT-based
LanguagesChinese, English
LicenseResearch Only (Non-commercial)

What is m3e-base?

M3E-base is a powerful bilingual embedding model developed by MokaAI, designed to convert text into dense vector representations. The model excels in both Chinese and English text processing, trained on over 22 million sentence pairs across diverse domains including encyclopedias, finance, healthcare, law, news, and academia.

Implementation Details

Built on the RoBERTa architecture, m3e-base employs in-batch negative sampling and contrastive learning techniques. The model was trained on A100 80G GPUs to maximize batch size efficiency, utilizing both massive Chinese datasets and 1.45M English triplets from the MEDI dataset.

  • 768-dimensional output embeddings
  • Trained on 22M+ sentence pairs
  • Supports both sentence-to-sentence and sentence-to-passage tasks
  • Achieves SOTA performance on Chinese text retrieval tasks

Core Capabilities

  • Bilingual text embedding generation
  • Semantic similarity computation
  • Text retrieval and search
  • Document classification
  • Question-answer matching

Frequently Asked Questions

Q: What makes this model unique?

M3E-base stands out for its comprehensive training on massive Chinese-English datasets, superior performance in both semantic similarity and retrieval tasks, and seamless integration with the sentence-transformers ecosystem.

Q: What are the recommended use cases?

The model is ideal for Chinese-focused applications with some English requirements, including semantic search, document classification, and similarity matching. For purely multilingual scenarios, OpenAI's ada-002 might be more suitable.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.