bert-base-bg-cs-pl-ru-cased

Property	Value
Parameters	180M
Architecture	12-layer, 768-hidden, 12-heads
Paper	ACL Anthology W19-3712
Author	DeepPavlov

What is bert-base-bg-cs-pl-ru-cased?

SlavicBERT is a specialized BERT model designed specifically for Slavic languages, focusing on Bulgarian, Czech, Polish, and Russian. It was initialized from multilingual BERT and further trained on Russian News and Wikipedia data from these four languages. The model maintains case sensitivity and uses a custom subtoken vocabulary built from its training data.

Implementation Details

The model follows the BERT-base architecture with significant customizations for Slavic languages. It features 12 transformer layers, 768 hidden dimensions, and 12 attention heads, totaling 180M parameters. As of November 2021, it includes both Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) heads.

Specialized vocabulary for Slavic languages
Trained on multiple high-quality sources including news and Wikipedia
Case-sensitive processing
Built upon multilingual BERT architecture

Core Capabilities

Native support for Bulgarian, Czech, Polish, and Russian
Optimized for Named Entity Recognition tasks
Efficient cross-lingual transfer learning
Advanced contextual embeddings for Slavic languages

Frequently Asked Questions

Q: What makes this model unique?

This model is specifically optimized for Slavic languages, offering better performance than general multilingual models for Bulgarian, Czech, Polish, and Russian language tasks. Its specialized training on news and Wikipedia content makes it particularly effective for real-world applications.

Q: What are the recommended use cases?

The model is particularly well-suited for Named Entity Recognition tasks in Slavic languages, as well as general natural language understanding tasks like text classification, sentiment analysis, and sequence labeling in these languages. It's especially valuable for applications requiring cross-lingual transfer between Slavic languages.