herbert-base-cased
Property | Value |
---|---|
License | CC BY 4.0 |
Author | Allegro |
Paper | HerBERT: Efficiently Pretrained Transformer-based Language Model for Polish |
Downloads | 56,701 |
What is herbert-base-cased?
HerBERT is a sophisticated BERT-based Language Model specifically designed and trained for the Polish language. Developed by Allegro's Machine Learning Research Team in collaboration with the Linguistic Engineering Group at the Polish Academy of Sciences, it implements both Masked Language Modelling (MLM) and Sentence Structural Objective (SSO) with dynamic masking of whole words.
Implementation Details
The model was trained on an extensive corpus of Polish text totaling over 8.5B tokens from various sources including CCNet, National Corpus of Polish, Open Subtitles, Wikipedia, and Wolne Lektury. It utilizes a character-level byte-pair encoding tokenizer with a 50k token vocabulary.
- Implements CharBPETokenizer with HerbertTokenizerFast support
- Trained on transformers framework version 2.9
- Supports both MLM and SSO training objectives
Core Capabilities
- Advanced Polish language understanding and representation
- Efficient text tokenization with character-level BPE
- Support for cased text processing
- Compatibility with modern transformer architectures
- Optimized for Polish NLP tasks
Frequently Asked Questions
Q: What makes this model unique?
HerBERT stands out for its specialized training on Polish language corpora and its implementation of both MLM and SSO objectives, making it particularly effective for Polish language tasks. The use of dynamic whole-word masking and character-level BPE tokenization further enhances its performance.
Q: What are the recommended use cases?
The model is ideal for Polish language processing tasks including: text classification, named entity recognition, question answering, and general Polish language understanding. It's particularly well-suited for applications requiring deep understanding of Polish text structure and semantics.