MiniCPM-Embedding
Property | Value |
---|---|
Parameter Count | 2.4B |
Embedding Dimension | 2304 |
Max Input Tokens | 512 |
License | Apache-2.0 (code), Custom MiniCPM License (weights) |
Authors | ModelBest Inc., THUNLP, NEUIR |
What is MiniCPM-Embedding?
MiniCPM-Embedding is a powerful bilingual text embedding model developed collaboratively by ModelBest Inc., Tsinghua University NLP Lab, and Northeastern University IR Group. Built on MiniCPM-2B-sft-bf16, it specializes in both monolingual and cross-lingual text retrieval for Chinese and English content.
Implementation Details
The model employs a sophisticated architecture combining bidirectional attention mechanisms with Weighted Mean Pooling. Training involved a multi-stage process using approximately 6 million examples, including open-source, synthetic, and proprietary data. The model outputs 2304-dimensional embeddings and can process up to 512 tokens per input.
- Bidirectional attention architecture for comprehensive text understanding
- Weighted Mean Pooling for effective representation learning
- Multi-stage training approach with diverse data sources
- Support for instruction-based querying
Core Capabilities
- Strong performance in Chinese and English monolingual retrieval (76.76 NDCG@10 on C-MTEB/Retrieval)
- Excellence in cross-lingual retrieval (72.95% Recall@20 on MKQA En-Zh_CN)
- Flexible query formatting with optional instruction prefixing
- Enhanced performance when combined with MiniCPM-Reranker
Frequently Asked Questions
Q: What makes this model unique?
The model stands out for its exceptional bilingual capabilities and state-of-the-art performance in both monolingual and cross-lingual retrieval tasks, particularly outperforming many existing models in Chinese-English cross-lingual scenarios.
Q: What are the recommended use cases?
MiniCPM-Embedding is ideal for building bilingual search systems, cross-lingual information retrieval, document matching, and as part of RAG (Retrieval-Augmented Generation) systems. It's particularly valuable for applications requiring Chinese-English language understanding.