bert-base-bg-cs-pl-ru-cased
Property | Value |
---|---|
Parameters | 180M |
Architecture | 12-layer, 768-hidden, 12-heads |
Paper | ACL Anthology W19-3712 |
Author | DeepPavlov |
What is bert-base-bg-cs-pl-ru-cased?
SlavicBERT is a specialized BERT model designed specifically for Slavic languages, focusing on Bulgarian, Czech, Polish, and Russian. It was initialized from multilingual BERT and further trained on Russian News and Wikipedia data from these four languages. The model maintains case sensitivity and uses a custom subtoken vocabulary built from its training data.
Implementation Details
The model follows the BERT-base architecture with significant customizations for Slavic languages. It features 12 transformer layers, 768 hidden dimensions, and 12 attention heads, totaling 180M parameters. As of November 2021, it includes both Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) heads.
- Specialized vocabulary for Slavic languages
- Trained on multiple high-quality sources including news and Wikipedia
- Case-sensitive processing
- Built upon multilingual BERT architecture
Core Capabilities
- Native support for Bulgarian, Czech, Polish, and Russian
- Optimized for Named Entity Recognition tasks
- Efficient cross-lingual transfer learning
- Advanced contextual embeddings for Slavic languages
Frequently Asked Questions
Q: What makes this model unique?
This model is specifically optimized for Slavic languages, offering better performance than general multilingual models for Bulgarian, Czech, Polish, and Russian language tasks. Its specialized training on news and Wikipedia content makes it particularly effective for real-world applications.
Q: What are the recommended use cases?
The model is particularly well-suited for Named Entity Recognition tasks in Slavic languages, as well as general natural language understanding tasks like text classification, sentiment analysis, and sequence labeling in these languages. It's especially valuable for applications requiring cross-lingual transfer between Slavic languages.