GTE-large-zh

Property	Value
Parameters	326M
Maximum Sequence Length	512 tokens
Embedding Dimension	1024
License	MIT
Paper	Research Paper

What is gte-large-zh?

GTE-large-zh is a state-of-the-art Chinese language text embedding model developed by Alibaba DAMO Academy. It's designed to generate high-quality text embeddings for Chinese language content, achieving superior performance across various NLP tasks. The model leads the CMTEB benchmark with an average score of 66.72 across 35 datasets, surpassing other popular models in the field.

Implementation Details

Built on the BERT architecture, GTE-large-zh employs multi-stage contrastive learning on a diverse corpus of relevance text pairs. The model generates 1024-dimensional embeddings and can process sequences up to 512 tokens in length.

Achieves 71.34% accuracy on classification tasks
Demonstrates 53.07% performance on clustering tasks
Excels in pair classification with 81.14% accuracy
Shows strong performance in reranking (67.42%) and retrieval (72.49%) tasks

Core Capabilities

Information Retrieval
Semantic Textual Similarity
Text Reranking
Document Classification
Clustering Applications

Frequently Asked Questions

Q: What makes this model unique?

GTE-large-zh stands out for its exceptional performance on the CMTEB benchmark, outperforming other models while maintaining a relatively compact size of 326M parameters. It's particularly noteworthy for achieving balanced performance across different NLP tasks while being optimized for Chinese language processing.

Q: What are the recommended use cases?

The model is ideal for applications requiring semantic understanding of Chinese text, including search systems, recommendation engines, document similarity analysis, and content classification. It's particularly effective for enterprise-scale applications requiring high-quality text embeddings.

gte-large-zh