Snowflake Arctic-Embed-M-Long
Property | Value |
---|---|
Parameters | 137M |
Embedding Dimension | 768 |
Max Context Length | 8192 tokens (with RPE) |
License | Apache-2.0 |
Paper | Technical Report |
What is snowflake-arctic-embed-m-long?
Snowflake-arctic-embed-m-long is a state-of-the-art text embedding model specifically designed for long-context retrieval tasks. Based on the nomic-ai/nomic-embed-text-v1-unsupervised architecture, it achieves an impressive MTEB Retrieval Score (NDCG@10) of 54.83, outperforming similar models in its class.
Implementation Details
The model leverages a multi-stage training pipeline, combining large-batch pretraining on 400M samples with specialized fine-tuning on 1M carefully curated triplets. It implements Rotary Position Embedding (RPE) to handle sequences up to 8192 tokens, making it particularly suitable for long document processing.
- 768-dimensional embeddings for optimal representation
- Support for both standard (2048 tokens) and extended (8192 tokens with RPE) contexts
- Optimized for retrieval tasks with specialized query-document architecture
Core Capabilities
- High-quality text embeddings for retrieval and similarity tasks
- Extended context length support with RPE scaling
- Efficient processing of both short and long documents
- State-of-the-art performance in MTEB benchmarks
Frequently Asked Questions
Q: What makes this model unique?
The model combines extended context length support (up to 8192 tokens) with state-of-the-art retrieval performance, making it ideal for applications requiring both accuracy and long document processing.
Q: What are the recommended use cases?
The model excels in document retrieval, semantic search, and similarity matching tasks, particularly where long document context is important. It's especially suitable for enterprise applications requiring high-quality text embeddings.