bert-base-romanian-cased-v1

bert-base-romanian-cased-v1

dumitrescustefan

Romanian BERT base model trained on 15GB corpus with strong performance on UPOS (98.00%), XPOS (96.46%), NER (85.88%), and LAS (89.69%) tasks.

PropertyValue
Authordumitrescustefan
Training Data Size15.2GB
PaperThe birth of Romanian BERT (EMNLP 2020)
Training Corpus Lines90.15M
Training Corpus Words2.42B

What is bert-base-romanian-cased-v1?

bert-base-romanian-cased-v1 is a BERT-based language model specifically trained for the Romanian language. Built on a massive 15GB corpus comprising OPUS, OSCAR, and Wikipedia data, this model represents a significant advancement in Romanian natural language processing, outperforming multilingual BERT across multiple tasks.

Implementation Details

The model was trained on a carefully curated dataset consisting of 55.05M lines from OPUS, 33.56M lines from OSCAR, and 1.54M lines from Wikipedia. It implements the BERT base architecture with case-sensitive tokenization and requires specific text sanitization for optimal performance.

  • Requires comma-letter replacement for Romanian characters (ț, ș instead of ţ, ş)
  • Accessible through HuggingFace's transformers library
  • Demonstrates superior performance compared to multilingual BERT

Core Capabilities

  • UPOS Tagging: 98.00% accuracy
  • XPOS Tagging: 96.46% accuracy
  • Named Entity Recognition: 85.88% accuracy
  • Labeled Attachment Score: 89.69%
  • General Romanian language understanding and representation

Frequently Asked Questions

Q: What makes this model unique?

This model is the first dedicated Romanian BERT model, trained on a comprehensive Romanian corpus of 15.2GB. It consistently outperforms multilingual BERT across all evaluated tasks, making it the go-to choice for Romanian language processing.

Q: What are the recommended use cases?

The model is particularly effective for tasks such as part-of-speech tagging, named entity recognition, and dependency parsing in Romanian text. It's suitable for any NLP task requiring deep understanding of Romanian language structure and semantics.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026