SlovakBERT
Property | Value |
---|---|
Parameter Count | 125M |
License | MIT |
Architecture | RoBERTa |
Paper | arXiv:2109.15254 |
Training Data Size | 19.35GB |
What is slovakbert?
SlovakBERT is a state-of-the-art language model specifically designed for the Slovak language, developed by Gerulata Technologies. It's a case-sensitive RoBERTa-based model trained on a diverse corpus of Slovak text, including Wikipedia, OpenSubtitles, Oscar, and various web crawls, totaling 19.35GB of text data.
Implementation Details
The model was trained using fairseq on 4 Nvidia A100 GPUs for 300K steps, utilizing a batch size of 512 and sequence length of 512. It employs the Adam optimizer with carefully tuned hyperparameters and implements 16-bit float precision for efficient training.
- Trained on 181.6M unique sentences across multiple datasets
- Implements masked language modeling (MLM) objective
- Supports both PyTorch and TensorFlow frameworks
- Features special token handling for URLs and emails
Core Capabilities
- Masked Language Modeling for Slovak text
- Text embeddings generation
- Fine-tuning capabilities for downstream tasks
- Cross-framework compatibility (PyTorch/TensorFlow)
Frequently Asked Questions
Q: What makes this model unique?
SlovakBERT is specifically optimized for the Slovak language, trained on an extensive and diverse dataset of Slovak text. It has been trained with special consideration for Slovak-specific linguistic features and implements case-sensitivity for improved accuracy.
Q: What are the recommended use cases?
The model is primarily designed for fine-tuning on downstream tasks in Slovak language processing. It excels at masked language modeling and can be effectively used for text embeddings, sentiment analysis, and other NLP tasks requiring deep understanding of Slovak language context.