bert-base-romanian-cased-v1

Property	Value
Author	dumitrescustefan
Training Data Size	15.2GB
Paper	The birth of Romanian BERT (EMNLP 2020)
Training Corpus Lines	90.15M
Training Corpus Words	2.42B

What is bert-base-romanian-cased-v1?

bert-base-romanian-cased-v1 is a BERT-based language model specifically trained for the Romanian language. Built on a massive 15GB corpus comprising OPUS, OSCAR, and Wikipedia data, this model represents a significant advancement in Romanian natural language processing, outperforming multilingual BERT across multiple tasks.

Implementation Details

The model was trained on a carefully curated dataset consisting of 55.05M lines from OPUS, 33.56M lines from OSCAR, and 1.54M lines from Wikipedia. It implements the BERT base architecture with case-sensitive tokenization and requires specific text sanitization for optimal performance.

Requires comma-letter replacement for Romanian characters (ț, ș instead of ţ, ş)
Accessible through HuggingFace's transformers library
Demonstrates superior performance compared to multilingual BERT

Core Capabilities

UPOS Tagging: 98.00% accuracy
XPOS Tagging: 96.46% accuracy
Named Entity Recognition: 85.88% accuracy
Labeled Attachment Score: 89.69%
General Romanian language understanding and representation

Frequently Asked Questions

Q: What makes this model unique?

This model is the first dedicated Romanian BERT model, trained on a comprehensive Romanian corpus of 15.2GB. It consistently outperforms multilingual BERT across all evaluated tasks, making it the go-to choice for Romanian language processing.

Q: What are the recommended use cases?

The model is particularly effective for tasks such as part-of-speech tagging, named entity recognition, and dependency parsing in Romanian text. It's suitable for any NLP task requiring deep understanding of Romanian language structure and semantics.