BERTje: A Dutch BERT Model

Property	Value
Developer	GroNLP (University of Groningen)
Architecture	BERT-base (12 layers, cased)
Paper	arXiv:1912.09582
Language	Dutch

What is bert-base-dutch-cased?

BERTje is a specialized BERT model developed specifically for the Dutch language by researchers at the University of Groningen. As a cased model, it maintains sensitivity to capitalization, making it particularly effective for tasks like Named Entity Recognition (NER) and Part-of-Speech (POS) tagging in Dutch text processing.

Implementation Details

The model follows the BERT-base architecture with 12 transformer layers and uses cased tokenization. It can be easily implemented using both PyTorch and TensorFlow frameworks through the Hugging Face transformers library. Notable is its 2021 vocabulary update, with backward compatibility maintained through a specific version tag.

Supports both PyTorch and TensorFlow implementations
Uses cased tokenization for better proper noun handling
Updated vocabulary with backward compatibility options
Demonstrates superior performance in Dutch language tasks

Core Capabilities

Named Entity Recognition (NER) with 90.24% accuracy on CoNLL-2002
Part-of-speech tagging with 96.48% accuracy on UDv2.5 LassySmall
Outperforms multilingual BERT and other Dutch models in most benchmarks
Specialized in Dutch language understanding and processing

Frequently Asked Questions

Q: What makes this model unique?

BERTje stands out for its specialized focus on Dutch language processing, consistently outperforming multilingual alternatives like mBERT in Dutch-specific tasks. Its performance on NER and POS-tagging benchmarks makes it the go-to choice for Dutch language processing.

Q: What are the recommended use cases?

The model excels in Dutch language tasks, particularly Named Entity Recognition, Part-of-Speech tagging, and general Dutch text understanding. It's ideal for applications requiring deep Dutch language processing capabilities in academic, commercial, or research contexts.

bert-base-dutch-cased