GTE-base-zh
Property | Value |
---|---|
Parameter Count | 102M |
Max Sequence Length | 512 tokens |
Embedding Dimension | 768 |
License | MIT |
Paper | Link to paper |
What is gte-base-zh?
GTE-base-zh is a Chinese language text embedding model developed by Alibaba DAMO Academy as part of their General Text Embeddings (GTE) series. The model is built on the BERT framework and is specifically designed for generating high-quality text embeddings for Chinese language content. With 102M parameters and a 768-dimensional embedding space, it strikes a balance between computational efficiency and performance.
Implementation Details
The model processes text sequences up to 512 tokens in length and generates fixed-size embeddings that can be used for various downstream tasks. It employs multi-stage contrastive learning trained on a diverse corpus of relevance text pairs, enabling robust semantic understanding.
- Achieves 65.92 average score across 35 CMTEB benchmark datasets
- Optimized for both efficiency and performance with 768-dimensional embeddings
- Implements advanced normalization techniques for improved representation quality
Core Capabilities
- Information Retrieval (MAP@100: 70-80% across various datasets)
- Semantic Textual Similarity (Strong performance on STS tasks)
- Text Classification (71.26% average accuracy)
- Text Reranking (67.00% average performance)
- Clustering (53.86% V-measure score)
Frequently Asked Questions
Q: What makes this model unique?
GTE-base-zh stands out for its balanced architecture that provides strong performance across a wide range of tasks while maintaining reasonable computational requirements. It's particularly effective for Chinese language processing tasks and shows robust performance in real-world applications.
Q: What are the recommended use cases?
The model is ideal for applications requiring semantic search, document similarity comparison, content recommendation systems, and automated text classification in Chinese language contexts. It's particularly well-suited for production environments where computing resources need to be balanced with performance.