GTE-Small
Property | Value |
---|---|
Parameter Count | 33.4M |
Embedding Dimension | 384 |
Max Sequence Length | 512 |
License | MIT |
Paper | arXiv:2308.03281 |
What is gte-small?
GTE-Small is a lightweight text embedding model developed by Alibaba DAMO Academy, designed to generate high-quality text embeddings while maintaining computational efficiency. As part of the GTE (General Text Embeddings) family, it achieves remarkable performance with just 33.4M parameters, making it particularly suitable for resource-constrained environments.
Implementation Details
The model leverages a BERT-based architecture with a 384-dimensional embedding space and supports sequences up to 512 tokens. It employs multi-stage contrastive learning trained on diverse relevance text pairs, enabling strong performance across various text similarity tasks.
- Achieves 61.36 average score on MTEB benchmark
- Optimized for English language processing
- Supports both similarity scoring and embedding generation
- Implements efficient average pooling for embedding calculation
Core Capabilities
- Information Retrieval
- Semantic Textual Similarity
- Text Reranking
- Clustering Applications
- Classification Tasks
Frequently Asked Questions
Q: What makes this model unique?
GTE-Small offers an excellent balance between model size and performance, achieving competitive results on par with much larger models while being significantly more compact. It maintains 61.36 MTEB score despite being only 70MB in size.
Q: What are the recommended use cases?
The model excels in tasks requiring semantic understanding such as document similarity, information retrieval, and text classification. It's particularly well-suited for applications where computational resources are limited but high-quality embeddings are needed.