MiniCPM-Embedding

Maintained By
openbmb

MiniCPM-Embedding

PropertyValue
Parameter Count2.4B
Embedding Dimension2304
Max Input Tokens512
LicenseApache-2.0 (code), Custom MiniCPM License (weights)
AuthorsModelBest Inc., THUNLP, NEUIR

What is MiniCPM-Embedding?

MiniCPM-Embedding is a powerful bilingual text embedding model developed collaboratively by ModelBest Inc., Tsinghua University NLP Lab, and Northeastern University IR Group. Built on MiniCPM-2B-sft-bf16, it specializes in both monolingual and cross-lingual text retrieval for Chinese and English content.

Implementation Details

The model employs a sophisticated architecture combining bidirectional attention mechanisms with Weighted Mean Pooling. Training involved a multi-stage process using approximately 6 million examples, including open-source, synthetic, and proprietary data. The model outputs 2304-dimensional embeddings and can process up to 512 tokens per input.

  • Bidirectional attention architecture for comprehensive text understanding
  • Weighted Mean Pooling for effective representation learning
  • Multi-stage training approach with diverse data sources
  • Support for instruction-based querying

Core Capabilities

  • Strong performance in Chinese and English monolingual retrieval (76.76 NDCG@10 on C-MTEB/Retrieval)
  • Excellence in cross-lingual retrieval (72.95% Recall@20 on MKQA En-Zh_CN)
  • Flexible query formatting with optional instruction prefixing
  • Enhanced performance when combined with MiniCPM-Reranker

Frequently Asked Questions

Q: What makes this model unique?

The model stands out for its exceptional bilingual capabilities and state-of-the-art performance in both monolingual and cross-lingual retrieval tasks, particularly outperforming many existing models in Chinese-English cross-lingual scenarios.

Q: What are the recommended use cases?

MiniCPM-Embedding is ideal for building bilingual search systems, cross-lingual information retrieval, document matching, and as part of RAG (Retrieval-Augmented Generation) systems. It's particularly valuable for applications requiring Chinese-English language understanding.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.