bert-base-polish-cased-v1

Property	Value
Parameter Count	110M
Model Type	BERT Language Model
Architecture	12-layer, 768-hidden, 12-heads
Author	Darek Kłeczek
Training Corpus Size	68.3M lines, 646.5M words

What is bert-base-polish-cased-v1?

bert-base-polish-cased-v1 is a state-of-the-art Polish language model based on the BERT architecture. It's an improved cased version that handles Polish characters and accents correctly, trained on a carefully curated and deduplicated corpus of Polish text. The model incorporates Whole Word Masking and has been trained on diverse sources including Polish Wikipedia, Parliamentary Corpus, ParaCrawl, and deduplicated Open Subtitles.

Implementation Details

The model was trained using a three-phase approach with varying sequence lengths and batch sizes: 100K steps at 128 sequence length with batch size 2048, followed by another 100K steps with adjusted learning rate, and finally 100K steps at 512 sequence length with batch size 256. Training was performed on Google Cloud TPU v3-8.

Implements Whole Word Masking for better contextual understanding
Trained on 4.5B characters of Polish text
Achieves 81.7% average score on the KLEJ benchmark
Properly handles Polish-specific characters and accents

Core Capabilities

Text classification and sequence labeling
Named Entity Recognition (93.6% on NKJP-NER)
Sentiment Analysis (87.4% on PolEmo2.0-IN)
Question Answering and Text Completion
Masked Language Modeling

Frequently Asked Questions

Q: What makes this model unique?

This model is specifically optimized for Polish language processing, with proper handling of Polish characters and accents, trained on a carefully balanced and deduplicated dataset. It outperforms the uncased variant in tasks requiring precise character handling.

Q: What are the recommended use cases?

The model excels in various NLP tasks including named entity recognition, sentiment analysis, and text classification. It's particularly well-suited for applications requiring understanding of Polish text with proper case sensitivity and accent handling.