RobBERT v2 Dutch Base Model

Property	Value
Parameter Count	117M
License	MIT
Architecture	RoBERTa
Training Data	39GB Dutch OSCAR corpus

What is robbert-v2-dutch-base?

RobBERT is a state-of-the-art Dutch language model based on RoBERTa architecture. Developed by researchers at KU Leuven, it's trained on 6.6 billion words from the Dutch section of the OSCAR corpus. This model represents a significant advancement in Dutch natural language processing, achieving superior performance across multiple tasks.

Implementation Details

The model implements the RoBERTa architecture with 12 self-attention layers and 12 heads, containing 117M trainable parameters. It was trained using the Adam optimizer with polynomial decay and specific hyperparameters (beta_1=0.9, beta_2=0.98). The training process utilized up to 80 GPUs across multiple computing nodes.

Pre-trained on masked language modeling task
Trained for two epochs on 126 million lines of text
Uses weight decay of 0.1 and dropout of 0.1

Core Capabilities

Sentiment Analysis (95.1% accuracy on Dutch Book Reviews)
Named Entity Recognition (89.08% accuracy)
Coreference Resolution (99.23% accuracy)
Part-of-Speech Tagging (96.4% accuracy)
Zero-shot word prediction
Emotion detection

Frequently Asked Questions

Q: What makes this model unique?

RobBERT v2 stands out for its exceptional performance on Dutch language tasks, particularly with small datasets. It's the first Dutch language model to achieve over 95% accuracy on sentiment analysis and demonstrates superior performance in gender-related linguistic tasks.

Q: What are the recommended use cases?

The model excels in various Dutch NLP tasks including sentiment analysis, coreference resolution, named entity recognition, and part-of-speech tagging. It's particularly recommended for scenarios with limited training data, where it significantly outperforms other models.