bert-base-polish-cased-v1
Property | Value |
---|---|
Parameter Count | 110M |
Model Type | BERT Language Model |
Architecture | 12-layer, 768-hidden, 12-heads |
Author | Darek Kłeczek |
Training Corpus Size | 68.3M lines, 646.5M words |
What is bert-base-polish-cased-v1?
bert-base-polish-cased-v1 is a state-of-the-art Polish language model based on the BERT architecture. It's an improved cased version that handles Polish characters and accents correctly, trained on a carefully curated and deduplicated corpus of Polish text. The model incorporates Whole Word Masking and has been trained on diverse sources including Polish Wikipedia, Parliamentary Corpus, ParaCrawl, and deduplicated Open Subtitles.
Implementation Details
The model was trained using a three-phase approach with varying sequence lengths and batch sizes: 100K steps at 128 sequence length with batch size 2048, followed by another 100K steps with adjusted learning rate, and finally 100K steps at 512 sequence length with batch size 256. Training was performed on Google Cloud TPU v3-8.
- Implements Whole Word Masking for better contextual understanding
- Trained on 4.5B characters of Polish text
- Achieves 81.7% average score on the KLEJ benchmark
- Properly handles Polish-specific characters and accents
Core Capabilities
- Text classification and sequence labeling
- Named Entity Recognition (93.6% on NKJP-NER)
- Sentiment Analysis (87.4% on PolEmo2.0-IN)
- Question Answering and Text Completion
- Masked Language Modeling
Frequently Asked Questions
Q: What makes this model unique?
This model is specifically optimized for Polish language processing, with proper handling of Polish characters and accents, trained on a carefully balanced and deduplicated dataset. It outperforms the uncased variant in tasks requiring precise character handling.
Q: What are the recommended use cases?
The model excels in various NLP tasks including named entity recognition, sentiment analysis, and text classification. It's particularly well-suited for applications requiring understanding of Polish text with proper case sensitivity and accent handling.