piccolo-large-zh
Property | Value |
---|---|
Model Size | 0.65 GB |
Embedding Dimension | 1024 |
Max Sequence Length | 512 |
License | MIT |
What is piccolo-large-zh?
piccolo-large-zh is a state-of-the-art Chinese text embedding model developed by SenseTime Research. The model employs a two-stage training approach, first trained on 400 million weakly supervised text pairs, followed by fine-tuning on 20 million human-labeled pairs. It achieves an impressive average score of 64.11 across 35 different evaluation tasks on the CMTEB benchmark.
Implementation Details
The model uses a transformer-based architecture and implements a sophisticated training pipeline that includes both pair-wise and triplet contrastive learning. During the first stage, it uses binary contrastive loss with in-batch negatives, while the second stage incorporates hard negatives with improved contrastive loss.
- Supports both short-to-short and short-to-long text matching
- Implements efficient memory usage through fp16 and gradient checkpointing
- Utilizes specialized dataset sampling for optimal batch composition
- Incorporates query/passage prefixes for enhanced retrieval performance
Core Capabilities
- Classification (67.03% accuracy across 9 tasks)
- Clustering (47.04% performance across 4 tasks)
- Pair Classification (78.38% accuracy)
- Reranking (65.98% effectiveness)
- Retrieval (70.93% performance across 8 tasks)
- Semantic Textual Similarity (58.02% across 8 tasks)
Frequently Asked Questions
Q: What makes this model unique?
The model's distinctive feature is its two-stage training approach and specialized treatment of query/passage pairs with different max lengths (64 for queries, 512 for passages), making it particularly effective for retrieval tasks.
Q: What are the recommended use cases?
The model excels in text similarity matching, information retrieval, and document classification tasks. It's particularly well-suited for Chinese language applications requiring semantic understanding and comparison.