GTE-base-en-v1.5
Property | Value |
---|---|
Parameter Count | 137M |
Model Type | Text Embeddings |
Architecture | Transformer++ (BERT + RoPE + GLU) |
Max Sequence Length | 8192 tokens |
Embedding Dimension | 768 |
License | Apache 2.0 |
Paper | mGTE Paper |
What is gte-base-en-v1.5?
GTE-base-en-v1.5 is a state-of-the-art English text embedding model that represents a significant advancement in long-context text representation. Built on the transformer++ architecture, it achieves impressive performance on the MTEB benchmark while supporting sequences up to 8192 tokens in length. The model leverages advanced techniques like RoPE (Rotary Position Embedding) and GLU (Gated Linear Units) to enhance its capabilities.
Implementation Details
The model underwent a sophisticated multi-stage training process, including masked language modeling (MLM) on c4-en data, weak-supervised contrastive pre-training, and supervised contrastive fine-tuning. It achieves an average score of 64.11 on the MTEB benchmark, performing exceptionally well across various tasks including classification, clustering, and semantic textual similarity.
- Supports context lengths up to 8192 tokens
- Implements efficient transformer++ architecture with RoPE and GLU
- Trained using multi-stage strategy including MLM and contrastive learning
- Achieves SOTA performance within its size category
Core Capabilities
- High-quality text embeddings for semantic search and retrieval
- Strong performance on classification tasks (77.17% accuracy)
- Excellent clustering capabilities (46.82% v-measure)
- Robust semantic textual similarity (81.97% average)
- Efficient long-context processing
Frequently Asked Questions
Q: What makes this model unique?
The model's ability to handle 8192 token sequences while maintaining SOTA performance, combined with its efficient architecture and multi-stage training approach, sets it apart from other embedding models in its category.
Q: What are the recommended use cases?
The model excels in semantic search, document retrieval, text classification, clustering, and similarity comparison tasks. It's particularly suitable for applications requiring long text processing and high-quality semantic representations.