text2vec-base-chinese-paraphrase
Property | Value |
---|---|
Parameter Count | 118M |
License | Apache 2.0 |
Base Model | nghuyong/ernie-3.0-base-zh |
Embedding Dimension | 768 |
Max Sequence Length | 256 |
What is text2vec-base-chinese-paraphrase?
text2vec-base-chinese-paraphrase is a CoSENT (Cosine Sentence) model specifically designed for Chinese language processing. Built on ERNIE 3.0, it maps sentences to 768-dimensional dense vector spaces, excelling in semantic similarity tasks with state-of-the-art performance across multiple benchmarks.
Implementation Details
The model utilizes the CoSENT architecture with ERNIE 3.0 as its backbone, trained on carefully curated Chinese STS datasets. It implements mean pooling for sentence embeddings and achieves an impressive 63.08% average performance across various NLI benchmarks, including ATEC, BQ, LCQMC, and PAWSX.
- Trained using contrastive learning with cosine similarity objectives
- Optimized for both sentence-to-sentence and sentence-to-paragraph matching
- Supports efficient processing with 3066 QPS (Queries Per Second)
- Enhanced with s2p (sentence to paraphrase) data for improved long text representation
Core Capabilities
- Semantic sentence embedding generation
- Text similarity computation
- Paraphrase detection
- Information retrieval tasks
- Text clustering applications
Frequently Asked Questions
Q: What makes this model unique?
This model stands out for its superior performance in Chinese text matching tasks, achieving SOTA results through its enhanced training on both sentence-level and paragraph-level paraphrase data. It's specifically optimized for practical applications requiring semantic understanding of Chinese text.
Q: What are the recommended use cases?
The model is ideal for Chinese language applications requiring semantic similarity matching, particularly in scenarios involving sentence-to-paragraph comparisons, document retrieval, and semantic search. It's especially effective for tasks requiring understanding of longer text passages.