USER-base

Maintained By
deepvk

USER-base

PropertyValue
Model Size85M parameters
Embedding Dimension768
Base ArchitectureDeBERTa-v1-base
Hugging Facedeepvk/USER-base

What is USER-base?

USER-base (Universal Sentence Encoder for Russian) is a specialized sentence transformer model designed exclusively for the Russian language. It transforms Russian text into 768-dimensional dense vector representations, making it ideal for semantic search, clustering, and other NLP tasks. The model builds upon deepvk/deberta-v1-base and has been extensively trained on Russian language data.

Implementation Details

The model follows a sophisticated training approach inspired by bge-base-en but with Russian-specific optimizations. The training process involved two key stages: contrastive pre-training using weak supervision on the Russian mMarco corpus, followed by supervised fine-tuning using both symmetric and asymmetric data approaches. The model notably employs the innovative LM-Cocktail technique for merging different training objectives.

  • Training on over 3.3M positive pairs and 792K negative pairs
  • Implements both query and passage embeddings with specific prefixes
  • Outperforms other base-sized models on Encodechka and MTEB benchmarks

Core Capabilities

  • High-quality Russian text embeddings for semantic similarity tasks
  • Efficient information retrieval and passage matching
  • Clustering and semantic search optimization
  • Competitive performance on Russian NLP benchmarks

Frequently Asked Questions

Q: What makes this model unique?

USER-base is specifically optimized for Russian language processing, offering state-of-the-art performance while maintaining a relatively compact size of 85M parameters. It achieves impressive results on both Encodechka (0.772) and MTEB (0.666) benchmarks, outperforming other models of similar size.

Q: What are the recommended use cases?

The model excels in various scenarios: use "query:" and "passage:" prefixes for asymmetric tasks like passage retrieval and QA, use "query:" prefix for symmetric tasks like semantic similarity and paraphrase detection, and for embedding-based features in classification or clustering tasks.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.