text2vec-base-chinese

Property	Value
Parameter Count	102M
License	Apache 2.0
Author	shibing624
Base Model	hfl/chinese-macbert-base

What is text2vec-base-chinese?

text2vec-base-chinese is a powerful Chinese sentence embedding model that maps sentences to 768-dimensional dense vector space. Built using the CoSENT (Cosine Sentence) architecture, it's specifically designed for Chinese text processing tasks including semantic similarity, text matching, and information retrieval.

Implementation Details

The model is built on hfl/chinese-macbert-base and fine-tuned using a contrastive objective with cosine similarity computations. It processes input sequences up to 256 tokens and employs mean pooling for sentence embeddings. The model achieves impressive performance across various Chinese NLP benchmarks, including ATEC (31.93%), BQ (42.67%), and STS-B (79.30%).

Architecture: CoSENT with Transformer base and mean pooling
Input Processing: Supports sequences up to 128 tokens
Output: 768-dimensional dense vectors
Training Data: Fine-tuned on shibing624/nli_zh dataset

Core Capabilities

Semantic sentence embedding generation
Text similarity computation
Information retrieval
Supports multiple acceleration backends (ONNX, OpenVINO)
Efficient CPU and GPU inference options

Frequently Asked Questions

Q: What makes this model unique?

The model combines the CoSENT architecture with extensive Chinese language pre-training, offering state-of-the-art performance on Chinese semantic similarity tasks while maintaining efficient inference speeds (3008 QPS).

Q: What are the recommended use cases?

The model excels in Chinese sentence similarity tasks, semantic search, and text matching applications. It's particularly well-suited for applications requiring semantic understanding of Chinese text at the sentence level.