roberta-base-bne

Property	Value
Architecture	RoBERTa Base
Language	Spanish
Training Data	570GB BNE Corpus
License	Apache 2.0
Developer	PlanTL-GOB-ES

What is roberta-base-bne?

roberta-base-bne is a powerful Spanish language model based on the RoBERTa architecture, trained on an unprecedented 570GB of clean text from the National Library of Spain's web crawls. This model represents a significant advancement in Spanish natural language processing, leveraging data collected from 2009 to 2019 to create a robust foundation for various NLP tasks.

Implementation Details

The model was trained using a byte-level version of BPE tokenization with a 50,262 token vocabulary. Training was conducted over 48 hours using 16 computing nodes, each equipped with 4 NVIDIA V100 GPUs. The training corpus underwent extensive preprocessing, including sentence splitting, language detection, and deduplication, resulting in 201,080,084 documents with over 135 billion tokens.

Masked language modeling architecture based on RoBERTa
Extensive preprocessing pipeline for high-quality training data
Optimized for Spanish language understanding
State-of-the-art performance on multiple downstream tasks

Core Capabilities

Fill-mask task performance (primary pre-training objective)
Strong results in text classification (MLDoc F1: 0.9664)
Named Entity Recognition (CAPITEL-NERC F1: 0.8960)
Question Answering (SQAC F1: 0.7923)
Natural Language Inference (XNLI Accuracy: 0.8016)

Frequently Asked Questions

Q: What makes this model unique?

The model's uniqueness lies in its training on the largest Spanish corpus to date, compiled from the National Library of Spain's web crawls. This extensive dataset, combined with careful preprocessing and state-of-the-art architecture, makes it particularly effective for Spanish language tasks.

Q: What are the recommended use cases?

The model excels in masked language modeling tasks and can be fine-tuned for various downstream applications including question answering, text classification, and named entity recognition. It's particularly suitable for tasks requiring deep understanding of Spanish language context.

roberta-base-bne

roberta-base-bne

What is roberta-base-bne?

Implementation Details

Core Capabilities

Frequently Asked Questions

Q: What makes this model unique?

Q: What are the recommended use cases?

Related Models