LaBSE-en-ru

Maintained By
cointegrated

LaBSE-en-ru

PropertyValue
Parameter Count129M
Authorcointegrated
PaperLanguage-agnostic BERT Sentence Embedding
Model TypeBilingual BERT
LanguagesEnglish, Russian

What is LaBSE-en-ru?

LaBSE-en-ru is a specialized bilingual version of Google's Language-agnostic BERT Sentence Embedding (LaBSE) model, specifically optimized for English and Russian languages. This model represents a significant optimization, reducing the original model size to just 27% while maintaining the quality of embeddings for these two languages.

Implementation Details

The model utilizes the BERT architecture and has been carefully truncated to retain only English and Russian tokens in its vocabulary, resulting in a 90% reduction in vocabulary size. With 129M parameters, it offers efficient sentence embedding generation using PyTorch and the Transformers library.

  • Optimized vocabulary focused on English and Russian tokens
  • Supports sentence similarity tasks
  • Implements efficient embedding generation
  • Uses normalized pooler output for representations

Core Capabilities

  • Bilingual sentence embedding generation
  • Cross-lingual sentence similarity comparison
  • Efficient processing with reduced model size
  • Maximum sequence length of 64 tokens

Frequently Asked Questions

Q: What makes this model unique?

This model's uniqueness lies in its efficient bilingual optimization, offering the same quality as the original LaBSE but with significantly reduced model size and focused vocabulary for English and Russian languages.

Q: What are the recommended use cases?

The model is ideal for cross-lingual sentence similarity tasks between English and Russian, document alignment, and bilingual text processing applications where efficient computation is required.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.