roberta-base-ca-v2
Property | Value |
---|---|
License | Apache 2.0 |
Language | Catalan |
Training Data Size | ~35GB |
Architecture | RoBERTa Base |
What is roberta-base-ca-v2?
roberta-base-ca-v2 is a transformer-based masked language model specifically designed for the Catalan language. Developed by the Text Mining Unit at Barcelona Supercomputing Center, it's trained on a diverse 35GB corpus including Wikipedia, government documents, news articles, and web crawls. The model demonstrates state-of-the-art performance on various Catalan language understanding tasks.
Implementation Details
The model utilizes the RoBERTa architecture with byte-level BPE tokenization and a vocabulary size of 50,262 tokens. Training was conducted over 96 hours using 16 NVIDIA V100 GPUs, following the original RoBERTa training methodology.
- Trained on 14 different Catalan text sources
- Implements masked language modeling for pre-training
- Achieves competitive results across multiple benchmarks
Core Capabilities
- Named Entity Recognition (89.29% F1 score)
- Part-of-Speech Tagging (98.96% F1 score)
- Text Classification (74.26% accuracy)
- Question Answering (89.50% F1 score on CatalanQA)
- Textual Entailment (83.14% accuracy)
- Semantic Textual Similarity (79.07% combined score)
Frequently Asked Questions
Q: What makes this model unique?
This model is specifically optimized for Catalan language processing, trained on a comprehensive local dataset, and achieves superior performance compared to multilingual models like mBERT and XLM-RoBERTa on Catalan-specific tasks.
Q: What are the recommended use cases?
The model is primarily designed for masked language modeling but can be fine-tuned for various downstream tasks including question answering, text classification, named entity recognition, and other non-generative NLP tasks in Catalan.