nomic-embed-text-v1

Property	Value
Parameter Count	137M
Context Length	8192 tokens
License	Apache 2.0
Paper	arXiv:2402.01613

What is nomic-embed-text-v1?

nomic-embed-text-v1 is a state-of-the-art text embedding model that outperforms OpenAI's text-embedding-ada-002 and text-embedding-3-small on both short and long context tasks. With an impressive MTEB score of 62.39 and LoCo score of 85.53, it represents a significant advancement in open-source embedding technology.

Implementation Details

The model employs a multi-stage training pipeline, starting from a long-context BERT model. It uses unsupervised contrastive learning on diverse text pairs from sources like StackExchange and Quora, followed by fine-tuning on high-quality labeled datasets. The model requires specific task instruction prefixes for optimal performance.

Supports 8192 token context length with native scaling
Implements mean pooling for embedding generation
Available through multiple frameworks including Sentence Transformers and Transformers.js
Recently expanded to support multimodal capabilities through nomic-embed-vision-v1

Core Capabilities

Document embedding for RAG applications
Query embedding for search tasks
Text clustering and semantic duplicate detection
Classification task embeddings
Cross-modal alignment with vision embeddings

Frequently Asked Questions

Q: What makes this model unique?

The model combines open-source accessibility with state-of-the-art performance, supporting an extensive 8192 token context length while maintaining superior benchmark scores compared to proprietary alternatives.

Q: What are the recommended use cases?

The model excels in RAG applications, semantic search, document clustering, and classification tasks. It requires specific task prefixes (search_document, search_query, clustering, classification) for optimal performance in different scenarios.