roberta-base-ca-v2

Maintained By
projecte-aina

roberta-base-ca-v2

PropertyValue
LicenseApache 2.0
LanguageCatalan
Training Data Size~35GB
ArchitectureRoBERTa Base

What is roberta-base-ca-v2?

roberta-base-ca-v2 is a transformer-based masked language model specifically designed for the Catalan language. Developed by the Text Mining Unit at Barcelona Supercomputing Center, it's trained on a diverse 35GB corpus including Wikipedia, government documents, news articles, and web crawls. The model demonstrates state-of-the-art performance on various Catalan language understanding tasks.

Implementation Details

The model utilizes the RoBERTa architecture with byte-level BPE tokenization and a vocabulary size of 50,262 tokens. Training was conducted over 96 hours using 16 NVIDIA V100 GPUs, following the original RoBERTa training methodology.

  • Trained on 14 different Catalan text sources
  • Implements masked language modeling for pre-training
  • Achieves competitive results across multiple benchmarks

Core Capabilities

  • Named Entity Recognition (89.29% F1 score)
  • Part-of-Speech Tagging (98.96% F1 score)
  • Text Classification (74.26% accuracy)
  • Question Answering (89.50% F1 score on CatalanQA)
  • Textual Entailment (83.14% accuracy)
  • Semantic Textual Similarity (79.07% combined score)

Frequently Asked Questions

Q: What makes this model unique?

This model is specifically optimized for Catalan language processing, trained on a comprehensive local dataset, and achieves superior performance compared to multilingual models like mBERT and XLM-RoBERTa on Catalan-specific tasks.

Q: What are the recommended use cases?

The model is primarily designed for masked language modeling but can be fine-tuned for various downstream tasks including question answering, text classification, named entity recognition, and other non-generative NLP tasks in Catalan.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.