m3e-large

Maintained By
moka-ai

M3E-Large Text Embedding Model

PropertyValue
Parameter Count340M
Embedding Dimension768
LanguagesChinese, English
LicenseResearch Only (Non-commercial)

What is m3e-large?

M3E-large is part of the Moka Massive Mixed Embedding model family, representing the largest variant with 340M parameters. It's a sophisticated text embedding model trained on over 22 million sentence pairs, specifically designed for Chinese and English text processing. The model excels at generating high-quality text embeddings for similarity comparison and retrieval tasks.

Implementation Details

Built on the HuggingFace Sentence-Transformers framework, m3e-large utilizes a RoBERTa-based architecture fine-tuned with in-batch negative sampling and contrastive learning. The model achieves state-of-the-art performance on various Chinese NLP benchmarks, including a 0.6231 average accuracy on text classification tasks.

  • Trained on 22M+ diverse sentence pairs covering encyclopedic, financial, medical, legal, and academic domains
  • Optimized for both sentence-to-sentence (S2S) and sentence-to-passage (S2P) tasks
  • Implements efficient in-batch negative sampling using A100 80G GPUs
  • Fully compatible with the sentence-transformers ecosystem

Core Capabilities

  • High-performance text similarity computation (0.6231 accuracy on classification tasks)
  • Robust text retrieval functionality (0.7974 NDCG@10 on T2Ranking)
  • Multilingual support for Chinese and English
  • Seamless integration with popular frameworks like Chroma and Semantic Kernel

Frequently Asked Questions

Q: What makes this model unique?

M3E-large stands out for its massive training dataset of 22M+ sentence pairs and superior performance on Chinese NLP tasks. It offers a perfect balance between model size and performance, outperforming many competitors including OpenAI's ada-002 on specific benchmarks.

Q: What are the recommended use cases?

The model is ideal for: text similarity comparison, document retrieval systems, duplicate question detection, text classification, and building GPT memory modules. It's particularly well-suited for applications requiring strong Chinese language understanding.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.