roberta-base-ca-v2

Property	Value
License	Apache 2.0
Language	Catalan
Training Data Size	~35GB
Architecture	RoBERTa Base

What is roberta-base-ca-v2?

roberta-base-ca-v2 is a transformer-based masked language model specifically designed for the Catalan language. Developed by the Text Mining Unit at Barcelona Supercomputing Center, it's trained on a diverse 35GB corpus including Wikipedia, government documents, news articles, and web crawls. The model demonstrates state-of-the-art performance on various Catalan language understanding tasks.

Implementation Details

The model utilizes the RoBERTa architecture with byte-level BPE tokenization and a vocabulary size of 50,262 tokens. Training was conducted over 96 hours using 16 NVIDIA V100 GPUs, following the original RoBERTa training methodology.

Trained on 14 different Catalan text sources
Implements masked language modeling for pre-training
Achieves competitive results across multiple benchmarks

Core Capabilities

Named Entity Recognition (89.29% F1 score)
Part-of-Speech Tagging (98.96% F1 score)
Text Classification (74.26% accuracy)
Question Answering (89.50% F1 score on CatalanQA)
Textual Entailment (83.14% accuracy)
Semantic Textual Similarity (79.07% combined score)

Frequently Asked Questions

Q: What makes this model unique?

This model is specifically optimized for Catalan language processing, trained on a comprehensive local dataset, and achieves superior performance compared to multilingual models like mBERT and XLM-RoBERTa on Catalan-specific tasks.

Q: What are the recommended use cases?

The model is primarily designed for masked language modeling but can be fine-tuned for various downstream tasks including question answering, text classification, named entity recognition, and other non-generative NLP tasks in Catalan.