text2vec-base-chinese-paraphrase

Property	Value
Parameter Count	118M
License	Apache 2.0
Base Model	nghuyong/ernie-3.0-base-zh
Embedding Dimension	768
Max Sequence Length	256

What is text2vec-base-chinese-paraphrase?

text2vec-base-chinese-paraphrase is a CoSENT (Cosine Sentence) model specifically designed for Chinese language processing. Built on ERNIE 3.0, it maps sentences to 768-dimensional dense vector spaces, excelling in semantic similarity tasks with state-of-the-art performance across multiple benchmarks.

Implementation Details

The model utilizes the CoSENT architecture with ERNIE 3.0 as its backbone, trained on carefully curated Chinese STS datasets. It implements mean pooling for sentence embeddings and achieves an impressive 63.08% average performance across various NLI benchmarks, including ATEC, BQ, LCQMC, and PAWSX.

Trained using contrastive learning with cosine similarity objectives
Optimized for both sentence-to-sentence and sentence-to-paragraph matching
Supports efficient processing with 3066 QPS (Queries Per Second)
Enhanced with s2p (sentence to paraphrase) data for improved long text representation

Core Capabilities

Semantic sentence embedding generation
Text similarity computation
Paraphrase detection
Information retrieval tasks
Text clustering applications

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its superior performance in Chinese text matching tasks, achieving SOTA results through its enhanced training on both sentence-level and paragraph-level paraphrase data. It's specifically optimized for practical applications requiring semantic understanding of Chinese text.

Q: What are the recommended use cases?

The model is ideal for Chinese language applications requiring semantic similarity matching, particularly in scenarios involving sentence-to-paragraph comparisons, document retrieval, and semantic search. It's especially effective for tasks requiring understanding of longer text passages.