bert-base-romanian-cased-v1
Property | Value |
---|---|
Author | dumitrescustefan |
Training Data Size | 15.2GB |
Paper | The birth of Romanian BERT (EMNLP 2020) |
Training Corpus Lines | 90.15M |
Training Corpus Words | 2.42B |
What is bert-base-romanian-cased-v1?
bert-base-romanian-cased-v1 is a BERT-based language model specifically trained for the Romanian language. Built on a massive 15GB corpus comprising OPUS, OSCAR, and Wikipedia data, this model represents a significant advancement in Romanian natural language processing, outperforming multilingual BERT across multiple tasks.
Implementation Details
The model was trained on a carefully curated dataset consisting of 55.05M lines from OPUS, 33.56M lines from OSCAR, and 1.54M lines from Wikipedia. It implements the BERT base architecture with case-sensitive tokenization and requires specific text sanitization for optimal performance.
- Requires comma-letter replacement for Romanian characters (ț, ș instead of ţ, ş)
- Accessible through HuggingFace's transformers library
- Demonstrates superior performance compared to multilingual BERT
Core Capabilities
- UPOS Tagging: 98.00% accuracy
- XPOS Tagging: 96.46% accuracy
- Named Entity Recognition: 85.88% accuracy
- Labeled Attachment Score: 89.69%
- General Romanian language understanding and representation
Frequently Asked Questions
Q: What makes this model unique?
This model is the first dedicated Romanian BERT model, trained on a comprehensive Romanian corpus of 15.2GB. It consistently outperforms multilingual BERT across all evaluated tasks, making it the go-to choice for Romanian language processing.
Q: What are the recommended use cases?
The model is particularly effective for tasks such as part-of-speech tagging, named entity recognition, and dependency parsing in Romanian text. It's suitable for any NLP task requiring deep understanding of Romanian language structure and semantics.