CDE-Small-V1
Property | Value |
---|---|
Parameter Count | 281M parameters |
Model Type | Contextual Document Embeddings |
Paper | ArXiv Paper |
MTEB Score | 65.00 (Best for models under 400M params) |
What is cde-small-v1?
CDE-small-v1 is a groundbreaking text embedding model that introduces a novel two-stage approach to document embedding. It naturally integrates "context tokens" into the embedding process, achieving state-of-the-art performance on the MTEB leaderboard for models under 400M parameters.
Implementation Details
The model operates in two distinct stages: First, it gathers dataset information by embedding a subset of the corpus using a first-stage model. Second, it embeds queries and documents while conditioning on the corpus information from the first stage. This innovative approach allows the model to maintain context awareness while generating embeddings.
- Two-stage architecture for context-aware embeddings
- Compatible with both Transformers and Sentence-Transformers libraries
- Supports task-specific prefixes for optimal performance
- Requires exactly 512 context documents for optimal performance
Core Capabilities
- State-of-the-art performance on MTEB benchmark
- Efficient document and query embedding generation
- Robust performance even without specific corpus information
- Specialized handling of retrieval tasks through prefix prompting
Frequently Asked Questions
Q: What makes this model unique?
The model's two-stage approach and context-aware embedding generation set it apart, allowing it to achieve superior performance with a relatively small parameter count of 281M.
Q: What are the recommended use cases?
The model excels in document retrieval, semantic search, and text similarity tasks. It's particularly effective when you can provide corpus-specific context through the first-stage embedding process.
Q: How does it handle unknown corpora?
While the model performs best with corpus-specific context, it can still function effectively using provided random strings as context, with only a minor performance drop from 65.0 to 63.8 on MTEB.