dewey_en_beta
Property | Value |
---|---|
Parameter Count | 395M |
Model Type | Embedding Model |
Max Context Length | 128k tokens |
Embedding Dimension | 2048 |
Model URL | https://huggingface.co/infgrad/dewey_en_beta |
What is dewey_en_beta?
dewey_en_beta is an advanced English embedding model developed by infgrad in collaboration with Richinfo. Built on answerdotai/ModernBERT-large architecture, it represents a significant advancement in text embedding capabilities, particularly for long-form content. The model uniquely supports both single-vector and multi-vector embeddings, with the latter implementing a Colbert-like approach but with significantly fewer vectors.
Implementation Details
The model employs a novel training approach that achieves impressive results across various benchmarks. It features a flexible multi-vector combination method where vectors can be understood at span or chunk level rather than token level, allowing for customizable chunking based on specific use cases.
- 395M parameters with 2048-dimensional embeddings
- 128k token context window
- Support for both single and multi-vector embeddings
- Ultra-fast encoding speed thanks to ModernBert architecture
- State-of-the-art performance on LongEmbed benchmark (0.86 vs previous SOTA of 0.79)
Core Capabilities
- Long-form text embedding with superior performance
- Flexible chunk-based multi-vector representations
- Competitive performance on MTEB benchmark
- Instruction-tuned embedding generation
- Efficient processing of documents up to 128k tokens
Frequently Asked Questions
Q: What makes this model unique?
The model's ability to handle both single and multi-vector embeddings, combined with its extraordinary context length of 128k tokens and state-of-the-art performance on long-text tasks, sets it apart from other embedding models.
Q: What are the recommended use cases?
The model excels in long-document retrieval, semantic search, and document similarity tasks. It's particularly well-suited for applications requiring processing of long documents such as legal texts, academic papers, or technical documentation.