GTE-large-zh
Property | Value |
---|---|
Parameters | 326M |
Maximum Sequence Length | 512 tokens |
Embedding Dimension | 1024 |
License | MIT |
Paper | Research Paper |
What is gte-large-zh?
GTE-large-zh is a state-of-the-art Chinese language text embedding model developed by Alibaba DAMO Academy. It's designed to generate high-quality text embeddings for Chinese language content, achieving superior performance across various NLP tasks. The model leads the CMTEB benchmark with an average score of 66.72 across 35 datasets, surpassing other popular models in the field.
Implementation Details
Built on the BERT architecture, GTE-large-zh employs multi-stage contrastive learning on a diverse corpus of relevance text pairs. The model generates 1024-dimensional embeddings and can process sequences up to 512 tokens in length.
- Achieves 71.34% accuracy on classification tasks
- Demonstrates 53.07% performance on clustering tasks
- Excels in pair classification with 81.14% accuracy
- Shows strong performance in reranking (67.42%) and retrieval (72.49%) tasks
Core Capabilities
- Information Retrieval
- Semantic Textual Similarity
- Text Reranking
- Document Classification
- Clustering Applications
Frequently Asked Questions
Q: What makes this model unique?
GTE-large-zh stands out for its exceptional performance on the CMTEB benchmark, outperforming other models while maintaining a relatively compact size of 326M parameters. It's particularly noteworthy for achieving balanced performance across different NLP tasks while being optimized for Chinese language processing.
Q: What are the recommended use cases?
The model is ideal for applications requiring semantic understanding of Chinese text, including search systems, recommendation engines, document similarity analysis, and content classification. It's particularly effective for enterprise-scale applications requiring high-quality text embeddings.