SlovakBERT

Property	Value
Parameter Count	125M
License	MIT
Architecture	RoBERTa
Paper	arXiv:2109.15254
Training Data Size	19.35GB

What is slovakbert?

SlovakBERT is a state-of-the-art language model specifically designed for the Slovak language, developed by Gerulata Technologies. It's a case-sensitive RoBERTa-based model trained on a diverse corpus of Slovak text, including Wikipedia, OpenSubtitles, Oscar, and various web crawls, totaling 19.35GB of text data.

Implementation Details

The model was trained using fairseq on 4 Nvidia A100 GPUs for 300K steps, utilizing a batch size of 512 and sequence length of 512. It employs the Adam optimizer with carefully tuned hyperparameters and implements 16-bit float precision for efficient training.

Trained on 181.6M unique sentences across multiple datasets
Implements masked language modeling (MLM) objective
Supports both PyTorch and TensorFlow frameworks
Features special token handling for URLs and emails

Core Capabilities

Masked Language Modeling for Slovak text
Text embeddings generation
Fine-tuning capabilities for downstream tasks
Cross-framework compatibility (PyTorch/TensorFlow)

Frequently Asked Questions

Q: What makes this model unique?

SlovakBERT is specifically optimized for the Slovak language, trained on an extensive and diverse dataset of Slovak text. It has been trained with special consideration for Slovak-specific linguistic features and implements case-sensitivity for improved accuracy.

Q: What are the recommended use cases?

The model is primarily designed for fine-tuning on downstream tasks in Slovak language processing. It excels at masked language modeling and can be effectively used for text embeddings, sentiment analysis, and other NLP tasks requiring deep understanding of Slovak language context.

slovakbert