ember-v1
Property | Value |
---|---|
Parameter Count | 335M |
Architecture | BERT-based with RetroMAE and SetFit enhancements |
Dimensions | 1024 |
Max Sequence Length | 512 tokens |
License | MIT |
What is ember-v1?
ember-v1 is a cutting-edge text embedding model that has achieved state-of-the-art performance on the Massive Text Embedding Benchmark (MTEB), scoring 63.54 across 56 tasks. The model combines techniques from RetroMAE and SetFit research to create high-quality embeddings for various applications including similarity search, clustering, and classification.
Implementation Details
The model generates 1024-dimensional embeddings and can handle sequences up to 512 tokens in length. It has been trained on a diverse corpus spanning finance, science, medicine, law, and other domains, making it versatile for different applications.
- Achieves superior performance compared to competitors like bge-large-en-v1.5 and OpenAI's text-embedding-ada-002
- Implements average pooling strategy for generating embeddings
- Supports both transformers and sentence-transformers libraries
Core Capabilities
- Text Classification (91.98% accuracy on Amazon Polarity)
- Semantic Similarity (87.77% Spearman correlation on STSBenchmark)
- Information Retrieval (85.51% MAP@10 on Quora Retrieval)
- Clustering (65.54% V-measure on StackExchange)
Frequently Asked Questions
Q: What makes this model unique?
The model combines advanced training techniques from RetroMAE and SetFit to achieve state-of-the-art performance while maintaining a relatively compact architecture. It outperforms larger models like OpenAI's ada-002 on the MTEB benchmark.
Q: What are the recommended use cases?
ember-v1 excels in semantic search, document clustering, text classification, and similarity assessment. It's particularly effective for English language tasks in professional domains like finance, science, and legal applications.