ru-en-RoSBERTa
Property | Value |
---|---|
Parameter Count | 404M |
Base Model | ruRoBERTa-large |
License | MIT |
Paper | arXiv:2408.12503 |
Languages | Russian, English |
What is ru-en-RoSBERTa?
ru-en-RoSBERTa is a sophisticated text embedding model specifically designed for Russian language processing with additional English capabilities. Built on the ruRoBERTa architecture, it has been fine-tuned using approximately 4 million pairs of supervised, synthetic, and unsupervised data in both Russian and English. The model incorporates a unique prefix-based approach for different tasks, making it highly versatile for various NLP applications.
Implementation Details
The model utilizes CLS pooling as the recommended approach and supports three main prefix types for different use cases: "search_query"/"search_document" for retrieval tasks, "classification" for paraphrasing tasks, and "clustering" for thematic analysis. It has a maximum input length of 512 tokens and includes English tokens from the original RoBERTa tokenizer.
- Supports both Transformers and SentenceTransformers implementations
- Includes normalized embeddings output
- Features task-specific prefixes for optimal performance
- Implements CLS and mean pooling options
Core Capabilities
- Bilingual text embedding generation
- Answer and relevant paragraph retrieval
- Semantic textual similarity assessment
- Topic classification and clustering
- Cross-lingual text processing
Frequently Asked Questions
Q: What makes this model unique?
The model's distinctive feature is its prefix-based approach that allows it to handle different types of tasks with the same underlying architecture, combined with its bilingual capabilities and extensive training on diverse data pairs.
Q: What are the recommended use cases?
The model excels in various tasks including semantic search, paraphrase detection, text classification, and clustering. It's particularly effective for Russian language processing while maintaining capability in English contexts.