papuGaPT2
Property | Value |
---|---|
Language | Polish |
Training Data | Oscar Corpus (Polish subset) |
Evaluation Perplexity | 21.79 |
Author | flax-community |
What is papuGaPT2?
papuGaPT2 is a Polish language GPT2 model designed to bring advanced text generation capabilities to the Polish NLP community. Built on the standard GPT2 architecture, it was trained using a causal language modeling approach on the Polish subset of the multilingual Oscar corpus. The model achieved an impressive evaluation perplexity of 21.79, making it a powerful tool for Polish text generation tasks.
Implementation Details
The model uses a byte-level version of Byte Pair Encoding for tokenization with a vocabulary size of 50,257. Training was conducted on a TPUv3 VM in three phases, with varying learning rates and batch sizes. The final training resulted in an evaluation loss of 3.082.
- Tokenization: Byte-level BPE with 50,257 vocab size
- Input sequences: 512 consecutive tokens
- Training infrastructure: TPUv3 VM
- Training phases: 3 distinct phases with different learning rates
Core Capabilities
- Text generation with multiple decoding methods (greedy, beam search, sampling)
- Support for top-k and top-p sampling
- Zero-shot and few-shot learning capabilities
- Bad words filtering functionality
- Context-aware text completion
Frequently Asked Questions
Q: What makes this model unique?
This is one of the first strong text generation models specifically trained for the Polish language, filling a crucial gap in Polish NLP research. Its performance and versatility make it particularly valuable for Polish language processing tasks.
Q: What are the recommended use cases?
The model is primarily recommended for research purposes due to potential biases in the training data. It can be used for text generation, feature extraction, or fine-tuning for downstream tasks. However, users should be aware of and account for potential biases, particularly regarding gender and ethnicity.