MiniCPM-Embedding

Property	Value
Parameter Count	2.4B
Embedding Dimension	2304
Max Input Tokens	512
License	Apache-2.0 (code), Custom MiniCPM License (weights)
Authors	ModelBest Inc., THUNLP, NEUIR

What is MiniCPM-Embedding?

MiniCPM-Embedding is a powerful bilingual text embedding model developed collaboratively by ModelBest Inc., Tsinghua University NLP Lab, and Northeastern University IR Group. Built on MiniCPM-2B-sft-bf16, it specializes in both monolingual and cross-lingual text retrieval for Chinese and English content.

Implementation Details

The model employs a sophisticated architecture combining bidirectional attention mechanisms with Weighted Mean Pooling. Training involved a multi-stage process using approximately 6 million examples, including open-source, synthetic, and proprietary data. The model outputs 2304-dimensional embeddings and can process up to 512 tokens per input.

Bidirectional attention architecture for comprehensive text understanding
Weighted Mean Pooling for effective representation learning
Multi-stage training approach with diverse data sources
Support for instruction-based querying

Core Capabilities

Strong performance in Chinese and English monolingual retrieval (76.76 NDCG@10 on C-MTEB/Retrieval)
Excellence in cross-lingual retrieval (72.95% Recall@20 on MKQA En-Zh_CN)
Flexible query formatting with optional instruction prefixing
Enhanced performance when combined with MiniCPM-Reranker

Frequently Asked Questions

Q: What makes this model unique?

The model stands out for its exceptional bilingual capabilities and state-of-the-art performance in both monolingual and cross-lingual retrieval tasks, particularly outperforming many existing models in Chinese-English cross-lingual scenarios.

Q: What are the recommended use cases?

MiniCPM-Embedding is ideal for building bilingual search systems, cross-lingual information retrieval, document matching, and as part of RAG (Retrieval-Augmented Generation) systems. It's particularly valuable for applications requiring Chinese-English language understanding.