cde-small-v1

cde-small-v1

jxm

State-of-the-art small embedding model (281M params) achieving 65.0 MTEB score through innovative contextual document embedding approach

PropertyValue
Parameter Count281M parameters
Model TypeContextual Document Embeddings
PaperArXiv Paper
MTEB Score65.00 (Best for models under 400M params)

What is cde-small-v1?

CDE-small-v1 is a groundbreaking text embedding model that introduces a novel two-stage approach to document embedding. It naturally integrates "context tokens" into the embedding process, achieving state-of-the-art performance on the MTEB leaderboard for models under 400M parameters.

Implementation Details

The model operates in two distinct stages: First, it gathers dataset information by embedding a subset of the corpus using a first-stage model. Second, it embeds queries and documents while conditioning on the corpus information from the first stage. This innovative approach allows the model to maintain context awareness while generating embeddings.

  • Two-stage architecture for context-aware embeddings
  • Compatible with both Transformers and Sentence-Transformers libraries
  • Supports task-specific prefixes for optimal performance
  • Requires exactly 512 context documents for optimal performance

Core Capabilities

  • State-of-the-art performance on MTEB benchmark
  • Efficient document and query embedding generation
  • Robust performance even without specific corpus information
  • Specialized handling of retrieval tasks through prefix prompting

Frequently Asked Questions

Q: What makes this model unique?

The model's two-stage approach and context-aware embedding generation set it apart, allowing it to achieve superior performance with a relatively small parameter count of 281M.

Q: What are the recommended use cases?

The model excels in document retrieval, semantic search, and text similarity tasks. It's particularly effective when you can provide corpus-specific context through the first-stage embedding process.

Q: How does it handle unknown corpora?

While the model performs best with corpus-specific context, it can still function effectively using provided random strings as context, with only a minor performance drop from 65.0 to 63.8 on MTEB.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026