cde-small-v2
Property | Value |
---|---|
Parameter Count | 140M (effective) |
MTEB Score | 65.58 |
Paper | Contextual Document Embeddings |
Author | jxm |
What is cde-small-v2?
cde-small-v2 is a cutting-edge embedding model that introduces a novel two-stage architecture for generating context-aware document embeddings. As of January 2025, it ranks as the best small model (under 400M parameters) on the MTEB leaderboard for text embedding models.
Implementation Details
The model employs a unique two-stage architecture where the first stage gathers dataset information by embedding a corpus subset, while the second stage handles the actual embedding of queries and documents. This innovative approach allows for better context integration and improved embedding quality.
- Uses ModernBERT as the base architecture
- Implements residual connections between model stages
- Features optimized pooling and position-embedding strategies
- Trained on nomic-unsupervised dataset and fine-tuned on BGE dataset
Core Capabilities
- High-quality document and query embeddings
- Context-aware embedding generation
- Efficient two-stage processing
- Support for both Transformers and Sentence Transformers implementations
Frequently Asked Questions
Q: What makes this model unique?
The model's distinctive feature is its two-stage architecture that naturally integrates context tokens into the embedding process, allowing for more nuanced and context-aware embeddings while maintaining a relatively small parameter count.
Q: What are the recommended use cases?
The model is particularly well-suited for document retrieval tasks, semantic search applications, and any use case requiring high-quality text embeddings with context awareness. It performs especially well when corpus information is available ahead of time.