Conan-embedding-v1

Property	Value
Parameters	326M
License	CC-BY-NC 4.0
Architecture	BERT-based
Paper	arXiv:2408.15710
Author	TencentBAC

What is Conan-embedding-v1?

Conan-embedding-v1 is a state-of-the-art Chinese text embedding model developed by Tencent BAC Group. It achieves impressive performance across multiple benchmarks with an average score of 72.62, outperforming competitors in tasks like classification, clustering, and retrieval. The model uniquely employs enhanced negative sampling techniques to generate more effective text embeddings.

Implementation Details

The model is implemented using PyTorch and follows the BERT architecture, optimized for generating text embeddings. It uses F32 tensor types and leverages Safetensors for model storage.

Achieves 75.03% accuracy on classification tasks
66.33% performance on clustering tasks
72.76% effectiveness in reranking scenarios
76.67% accuracy in retrieval applications

Core Capabilities

Robust performance across Chinese NLP tasks
Specialized negative sampling methodology
Strong multilingual sentence embedding capabilities
Efficient retrieval and reranking capabilities

Frequently Asked Questions

Q: What makes this model unique?

The model's distinctive feature is its enhanced negative sampling approach, which helps it achieve superior performance across various Chinese NLP tasks. It particularly excels in sentence embedding tasks with consistent performance across classification, clustering, and retrieval benchmarks.

Q: What are the recommended use cases?

The model is particularly well-suited for Chinese text processing tasks including semantic similarity comparison, document clustering, information retrieval, and text classification. It's especially effective for applications requiring high-quality sentence embeddings in Chinese language contexts.