GTE-Large: General Text Embeddings Model

Property	Value
Parameter Count	335M
Dimension	1024
Max Sequence Length	512
License	MIT
Paper	arXiv:2308.03281

What is gte-large?

GTE-Large is a state-of-the-art text embedding model developed by Alibaba DAMO Academy. It represents the largest variant in the GTE family, designed specifically for generating high-quality text embeddings through multi-stage contrastive learning. The model achieves an impressive 63.13% average score on the MTEB benchmark, outperforming other popular models like E5-large-v2 and OpenAI's text-embedding-ada-002.

Implementation Details

Built on the BERT architecture, GTE-Large generates 1024-dimensional embeddings and can process sequences up to 512 tokens in length. The model is trained on a diverse corpus of relevance text pairs, enabling robust performance across various domains.

Advanced multi-stage contrastive learning approach
Optimized for both semantic similarity and information retrieval tasks
Supports batch processing with optional embedding normalization
Implements efficient average pooling for token aggregation

Core Capabilities

Information Retrieval (52.22% MTEB score)
Semantic Textual Similarity (83.35% MTEB score)
Text Reranking (59.13% MTEB score)
Clustering (46.84% MTEB score)
Classification Tasks (73.33% MTEB score)

Frequently Asked Questions

Q: What makes this model unique?

GTE-Large combines large-scale parameter capacity with multi-stage contrastive learning, achieving superior performance while maintaining a relatively compact size compared to other large language models. It particularly excels in semantic similarity tasks and provides a good balance between model size and performance.

Q: What are the recommended use cases?

The model is particularly well-suited for text similarity comparison, document retrieval, semantic search, and content recommendation systems. It's especially effective for applications requiring high-quality text embeddings with reasonable computational requirements.

gte-large