piccolo-large-zh-v2

Property	Value
Model Size	1.21 GB
Embedding Dimension	1792
Paper	arXiv:2405.06932
Sequence Length	512
C-MTEB Score	70.95 (Current SOTA)

What is piccolo-large-zh-v2?

piccolo-large-zh-v2 is an advanced Chinese embedding model developed by SenseTime Research that leverages multi-task hybrid loss training to achieve state-of-the-art performance on the C-MTEB benchmark. The model builds upon the success of its predecessor by implementing an efficient training approach that combines multiple specialized loss functions for different types of tasks.

Implementation Details

The model employs three distinct training approaches: InfoNCE with in-batch-negative for retrieval/sorting tasks, cosent loss for STS/pair classification tasks, and a modified InfoNCE approach for classification/clustering tasks. It's built on the stella-v3.5 architecture and trained for 2500 steps on 32 GPUs.

Supports flexible embedding dimensions (256 to 1792)
Implements MRL training for dimension adaptability
Achieves superior performance across 35 different evaluation metrics
Uses hybrid loss training to optimize for different task types

Core Capabilities

State-of-the-art performance on C-MTEB Chinese benchmarks
Excellent results in classification (74.59%), clustering (62.17%), and pair classification (90.24%)
Robust performance in reranking (70%) and retrieval tasks (74.36%)
Flexible dimension reduction while maintaining performance

Frequently Asked Questions

Q: What makes this model unique?

The model's distinctive feature is its multi-task hybrid loss training approach, combining different loss functions optimized for specific tasks, along with its flexible embedding dimension support ranging from 256 to 1792 dimensions.

Q: What are the recommended use cases?

The model excels in various scenarios including text similarity comparison, document retrieval, classification tasks, and clustering applications. It's particularly effective for Chinese language processing tasks requiring high-quality semantic embeddings.