herbert-base-cased

Property	Value
License	CC BY 4.0
Author	Allegro
Paper	HerBERT: Efficiently Pretrained Transformer-based Language Model for Polish
Downloads	56,701

What is herbert-base-cased?

HerBERT is a sophisticated BERT-based Language Model specifically designed and trained for the Polish language. Developed by Allegro's Machine Learning Research Team in collaboration with the Linguistic Engineering Group at the Polish Academy of Sciences, it implements both Masked Language Modelling (MLM) and Sentence Structural Objective (SSO) with dynamic masking of whole words.

Implementation Details

The model was trained on an extensive corpus of Polish text totaling over 8.5B tokens from various sources including CCNet, National Corpus of Polish, Open Subtitles, Wikipedia, and Wolne Lektury. It utilizes a character-level byte-pair encoding tokenizer with a 50k token vocabulary.

Implements CharBPETokenizer with HerbertTokenizerFast support
Trained on transformers framework version 2.9
Supports both MLM and SSO training objectives

Core Capabilities

Advanced Polish language understanding and representation
Efficient text tokenization with character-level BPE
Support for cased text processing
Compatibility with modern transformer architectures
Optimized for Polish NLP tasks

Frequently Asked Questions

Q: What makes this model unique?

HerBERT stands out for its specialized training on Polish language corpora and its implementation of both MLM and SSO objectives, making it particularly effective for Polish language tasks. The use of dynamic whole-word masking and character-level BPE tokenization further enhances its performance.

Q: What are the recommended use cases?

The model is ideal for Polish language processing tasks including: text classification, named entity recognition, question answering, and general Polish language understanding. It's particularly well-suited for applications requiring deep understanding of Polish text structure and semantics.