USER-base
Property | Value |
---|---|
Model Size | 85M parameters |
Embedding Dimension | 768 |
Base Architecture | DeBERTa-v1-base |
Hugging Face | deepvk/USER-base |
What is USER-base?
USER-base (Universal Sentence Encoder for Russian) is a specialized sentence transformer model designed exclusively for the Russian language. It transforms Russian text into 768-dimensional dense vector representations, making it ideal for semantic search, clustering, and other NLP tasks. The model builds upon deepvk/deberta-v1-base and has been extensively trained on Russian language data.
Implementation Details
The model follows a sophisticated training approach inspired by bge-base-en but with Russian-specific optimizations. The training process involved two key stages: contrastive pre-training using weak supervision on the Russian mMarco corpus, followed by supervised fine-tuning using both symmetric and asymmetric data approaches. The model notably employs the innovative LM-Cocktail technique for merging different training objectives.
- Training on over 3.3M positive pairs and 792K negative pairs
- Implements both query and passage embeddings with specific prefixes
- Outperforms other base-sized models on Encodechka and MTEB benchmarks
Core Capabilities
- High-quality Russian text embeddings for semantic similarity tasks
- Efficient information retrieval and passage matching
- Clustering and semantic search optimization
- Competitive performance on Russian NLP benchmarks
Frequently Asked Questions
Q: What makes this model unique?
USER-base is specifically optimized for Russian language processing, offering state-of-the-art performance while maintaining a relatively compact size of 85M parameters. It achieves impressive results on both Encodechka (0.772) and MTEB (0.666) benchmarks, outperforming other models of similar size.
Q: What are the recommended use cases?
The model excels in various scenarios: use "query:" and "passage:" prefixes for asymmetric tasks like passage retrieval and QA, use "query:" prefix for symmetric tasks like semantic similarity and paraphrase detection, and for embedding-based features in classification or clustering tasks.