LaBSE-en-ru
Property | Value |
---|---|
Parameter Count | 129M |
Author | cointegrated |
Paper | Language-agnostic BERT Sentence Embedding |
Model Type | Bilingual BERT |
Languages | English, Russian |
What is LaBSE-en-ru?
LaBSE-en-ru is a specialized bilingual version of Google's Language-agnostic BERT Sentence Embedding (LaBSE) model, specifically optimized for English and Russian languages. This model represents a significant optimization, reducing the original model size to just 27% while maintaining the quality of embeddings for these two languages.
Implementation Details
The model utilizes the BERT architecture and has been carefully truncated to retain only English and Russian tokens in its vocabulary, resulting in a 90% reduction in vocabulary size. With 129M parameters, it offers efficient sentence embedding generation using PyTorch and the Transformers library.
- Optimized vocabulary focused on English and Russian tokens
- Supports sentence similarity tasks
- Implements efficient embedding generation
- Uses normalized pooler output for representations
Core Capabilities
- Bilingual sentence embedding generation
- Cross-lingual sentence similarity comparison
- Efficient processing with reduced model size
- Maximum sequence length of 64 tokens
Frequently Asked Questions
Q: What makes this model unique?
This model's uniqueness lies in its efficient bilingual optimization, offering the same quality as the original LaBSE but with significantly reduced model size and focused vocabulary for English and Russian languages.
Q: What are the recommended use cases?
The model is ideal for cross-lingual sentence similarity tasks between English and Russian, document alignment, and bilingual text processing applications where efficient computation is required.