GTE-Base Text Embedding Model
Property | Value |
---|---|
Parameter Count | 109M |
Embedding Dimension | 768 |
Max Sequence Length | 512 tokens |
License | MIT |
Paper | arXiv:2308.03281 |
What is gte-base?
GTE-base (General Text Embeddings) is a medium-sized text embedding model developed by Alibaba DAMO Academy. It represents a balanced trade-off between model size and performance, achieving an impressive 62.39% average score across 56 MTEB benchmark tasks. The model is specifically designed for generating high-quality text embeddings that can be used for various natural language processing tasks.
Implementation Details
Built on the BERT architecture, GTE-base produces 768-dimensional embeddings and can process sequences up to 512 tokens in length. The model was trained using a multi-stage contrastive learning approach on a diverse dataset of relevance text pairs, enabling robust semantic understanding across different domains.
- Efficient architecture with 109M parameters
- Strong performance in clustering (46.2%), pair classification (84.57%), and semantic textual similarity (82.3%)
- Optimized for both accuracy and computational efficiency
Core Capabilities
- Information Retrieval and Document Search
- Semantic Textual Similarity Assessment
- Text Reranking and Classification
- Cross-lingual Text Understanding (English-focused)
- Efficient Text Embedding Generation
Frequently Asked Questions
Q: What makes this model unique?
GTE-base stands out for its excellent performance-to-size ratio, achieving comparable results to larger models while being more resource-efficient. It ranks highly on the MTEB leaderboard and provides a strong balance between computational requirements and embedding quality.
Q: What are the recommended use cases?
The model excels in information retrieval, semantic similarity tasks, and text classification. It's particularly well-suited for applications requiring efficient text embeddings without compromising on quality, such as search systems, recommendation engines, and document similarity analysis.