RobBERT v2 Dutch Base Model
Property | Value |
---|---|
Parameter Count | 117M |
License | MIT |
Architecture | RoBERTa |
Training Data | 39GB Dutch OSCAR corpus |
What is robbert-v2-dutch-base?
RobBERT is a state-of-the-art Dutch language model based on RoBERTa architecture. Developed by researchers at KU Leuven, it's trained on 6.6 billion words from the Dutch section of the OSCAR corpus. This model represents a significant advancement in Dutch natural language processing, achieving superior performance across multiple tasks.
Implementation Details
The model implements the RoBERTa architecture with 12 self-attention layers and 12 heads, containing 117M trainable parameters. It was trained using the Adam optimizer with polynomial decay and specific hyperparameters (beta_1=0.9, beta_2=0.98). The training process utilized up to 80 GPUs across multiple computing nodes.
- Pre-trained on masked language modeling task
- Trained for two epochs on 126 million lines of text
- Uses weight decay of 0.1 and dropout of 0.1
Core Capabilities
- Sentiment Analysis (95.1% accuracy on Dutch Book Reviews)
- Named Entity Recognition (89.08% accuracy)
- Coreference Resolution (99.23% accuracy)
- Part-of-Speech Tagging (96.4% accuracy)
- Zero-shot word prediction
- Emotion detection
Frequently Asked Questions
Q: What makes this model unique?
RobBERT v2 stands out for its exceptional performance on Dutch language tasks, particularly with small datasets. It's the first Dutch language model to achieve over 95% accuracy on sentiment analysis and demonstrates superior performance in gender-related linguistic tasks.
Q: What are the recommended use cases?
The model excels in various Dutch NLP tasks including sentiment analysis, coreference resolution, named entity recognition, and part-of-speech tagging. It's particularly recommended for scenarios with limited training data, where it significantly outperforms other models.