bert-base-romanian-cased-v1

Maintained By
dumitrescustefan

bert-base-romanian-cased-v1

PropertyValue
Authordumitrescustefan
Training Data Size15.2GB
PaperThe birth of Romanian BERT (EMNLP 2020)
Training Corpus Lines90.15M
Training Corpus Words2.42B

What is bert-base-romanian-cased-v1?

bert-base-romanian-cased-v1 is a BERT-based language model specifically trained for the Romanian language. Built on a massive 15GB corpus comprising OPUS, OSCAR, and Wikipedia data, this model represents a significant advancement in Romanian natural language processing, outperforming multilingual BERT across multiple tasks.

Implementation Details

The model was trained on a carefully curated dataset consisting of 55.05M lines from OPUS, 33.56M lines from OSCAR, and 1.54M lines from Wikipedia. It implements the BERT base architecture with case-sensitive tokenization and requires specific text sanitization for optimal performance.

  • Requires comma-letter replacement for Romanian characters (ț, ș instead of ţ, ş)
  • Accessible through HuggingFace's transformers library
  • Demonstrates superior performance compared to multilingual BERT

Core Capabilities

  • UPOS Tagging: 98.00% accuracy
  • XPOS Tagging: 96.46% accuracy
  • Named Entity Recognition: 85.88% accuracy
  • Labeled Attachment Score: 89.69%
  • General Romanian language understanding and representation

Frequently Asked Questions

Q: What makes this model unique?

This model is the first dedicated Romanian BERT model, trained on a comprehensive Romanian corpus of 15.2GB. It consistently outperforms multilingual BERT across all evaluated tasks, making it the go-to choice for Romanian language processing.

Q: What are the recommended use cases?

The model is particularly effective for tasks such as part-of-speech tagging, named entity recognition, and dependency parsing in Romanian text. It's suitable for any NLP task requiring deep understanding of Romanian language structure and semantics.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.