nomic-embed-text-v1
Property | Value |
---|---|
Parameter Count | 137M |
Context Length | 8192 tokens |
License | Apache 2.0 |
Paper | arXiv:2402.01613 |
What is nomic-embed-text-v1?
nomic-embed-text-v1 is a state-of-the-art text embedding model that outperforms OpenAI's text-embedding-ada-002 and text-embedding-3-small on both short and long context tasks. With an impressive MTEB score of 62.39 and LoCo score of 85.53, it represents a significant advancement in open-source embedding technology.
Implementation Details
The model employs a multi-stage training pipeline, starting from a long-context BERT model. It uses unsupervised contrastive learning on diverse text pairs from sources like StackExchange and Quora, followed by fine-tuning on high-quality labeled datasets. The model requires specific task instruction prefixes for optimal performance.
- Supports 8192 token context length with native scaling
- Implements mean pooling for embedding generation
- Available through multiple frameworks including Sentence Transformers and Transformers.js
- Recently expanded to support multimodal capabilities through nomic-embed-vision-v1
Core Capabilities
- Document embedding for RAG applications
- Query embedding for search tasks
- Text clustering and semantic duplicate detection
- Classification task embeddings
- Cross-modal alignment with vision embeddings
Frequently Asked Questions
Q: What makes this model unique?
The model combines open-source accessibility with state-of-the-art performance, supporting an extensive 8192 token context length while maintaining superior benchmark scores compared to proprietary alternatives.
Q: What are the recommended use cases?
The model excels in RAG applications, semantic search, document clustering, and classification tasks. It requires specific task prefixes (search_document, search_query, clustering, classification) for optimal performance in different scenarios.