MiniCPM-Embedding

MiniCPM-Embedding

openbmb

MiniCPM-Embedding is a 2.4B parameter bilingual embedding model for Chinese-English text retrieval, featuring 2304-dim embeddings and strong cross-lingual capabilities.

PropertyValue
Parameter Count2.4B
Embedding Dimension2304
Max Input Tokens512
LicenseApache-2.0 (code), Custom MiniCPM License (weights)
AuthorsModelBest Inc., THUNLP, NEUIR

What is MiniCPM-Embedding?

MiniCPM-Embedding is a powerful bilingual text embedding model developed collaboratively by ModelBest Inc., Tsinghua University NLP Lab, and Northeastern University IR Group. Built on MiniCPM-2B-sft-bf16, it specializes in both monolingual and cross-lingual text retrieval for Chinese and English content.

Implementation Details

The model employs a sophisticated architecture combining bidirectional attention mechanisms with Weighted Mean Pooling. Training involved a multi-stage process using approximately 6 million examples, including open-source, synthetic, and proprietary data. The model outputs 2304-dimensional embeddings and can process up to 512 tokens per input.

  • Bidirectional attention architecture for comprehensive text understanding
  • Weighted Mean Pooling for effective representation learning
  • Multi-stage training approach with diverse data sources
  • Support for instruction-based querying

Core Capabilities

  • Strong performance in Chinese and English monolingual retrieval (76.76 NDCG@10 on C-MTEB/Retrieval)
  • Excellence in cross-lingual retrieval (72.95% Recall@20 on MKQA En-Zh_CN)
  • Flexible query formatting with optional instruction prefixing
  • Enhanced performance when combined with MiniCPM-Reranker

Frequently Asked Questions

Q: What makes this model unique?

The model stands out for its exceptional bilingual capabilities and state-of-the-art performance in both monolingual and cross-lingual retrieval tasks, particularly outperforming many existing models in Chinese-English cross-lingual scenarios.

Q: What are the recommended use cases?

MiniCPM-Embedding is ideal for building bilingual search systems, cross-lingual information retrieval, document matching, and as part of RAG (Retrieval-Augmented Generation) systems. It's particularly valuable for applications requiring Chinese-English language understanding.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026