GIST-large-Embedding-v0
Property | Value |
---|---|
Parameter Count | 335M |
License | MIT |
Paper | GISTEmbed Paper |
Base Model | BAAI/bge-large-en-v1.5 |
What is GIST-large-Embedding-v0?
GIST-large-Embedding-v0 is an advanced text embedding model that implements the Guided In-sample Selection of Training Negatives (GIST) approach. Built on top of BAAI/bge-large-en-v1.5, it's specifically fine-tuned using a combination of the MEDI dataset and MTEB Classification training data, offering improved semantic search capabilities without requiring explicit instructions.
Implementation Details
The model was trained with carefully selected parameters including 40 epochs, a warmup ratio of 0.1, and a learning rate of 5e-6. It employs a contrastive loss temperature of 0.01 and uses a batch size of 16, with checkpoint steps at 171,000 iterations. The architecture maintains the original 335M parameters while introducing significant improvements in semantic understanding.
- No instruction requirement for generating embeddings
- Optimized for direct query encoding in retrieval tasks
- Fine-tuned on comprehensive dataset combinations
- Implements advanced negative sampling techniques
Core Capabilities
- High-performance semantic similarity matching
- Robust performance across multiple MTEB benchmark tasks
- Efficient text embedding generation
- Strong performance in classification and retrieval tasks
Frequently Asked Questions
Q: What makes this model unique?
The model's uniqueness lies in its ability to generate high-quality embeddings without requiring task-specific instructions, combined with its implementation of the GIST approach for negative sampling during training.
Q: What are the recommended use cases?
The model excels in semantic search, document retrieval, text classification, and similarity matching tasks. It's particularly effective for applications requiring robust text embeddings without the need for task-specific prompting.