GIST-large-Embedding-v0

Property	Value
Parameter Count	335M
License	MIT
Paper	GISTEmbed Paper
Base Model	BAAI/bge-large-en-v1.5

What is GIST-large-Embedding-v0?

GIST-large-Embedding-v0 is an advanced text embedding model that implements the Guided In-sample Selection of Training Negatives (GIST) approach. Built on top of BAAI/bge-large-en-v1.5, it's specifically fine-tuned using a combination of the MEDI dataset and MTEB Classification training data, offering improved semantic search capabilities without requiring explicit instructions.

Implementation Details

The model was trained with carefully selected parameters including 40 epochs, a warmup ratio of 0.1, and a learning rate of 5e-6. It employs a contrastive loss temperature of 0.01 and uses a batch size of 16, with checkpoint steps at 171,000 iterations. The architecture maintains the original 335M parameters while introducing significant improvements in semantic understanding.

No instruction requirement for generating embeddings
Optimized for direct query encoding in retrieval tasks
Fine-tuned on comprehensive dataset combinations
Implements advanced negative sampling techniques

Core Capabilities

High-performance semantic similarity matching
Robust performance across multiple MTEB benchmark tasks
Efficient text embedding generation
Strong performance in classification and retrieval tasks

Frequently Asked Questions

Q: What makes this model unique?

The model's uniqueness lies in its ability to generate high-quality embeddings without requiring task-specific instructions, combined with its implementation of the GIST approach for negative sampling during training.

Q: What are the recommended use cases?

The model excels in semantic search, document retrieval, text classification, and similarity matching tasks. It's particularly effective for applications requiring robust text embeddings without the need for task-specific prompting.