roberta-base-bne

Maintained By
PlanTL-GOB-ES

roberta-base-bne

PropertyValue
ArchitectureRoBERTa Base
LanguageSpanish
Training Data570GB BNE Corpus
LicenseApache 2.0
DeveloperPlanTL-GOB-ES

What is roberta-base-bne?

roberta-base-bne is a powerful Spanish language model based on the RoBERTa architecture, trained on an unprecedented 570GB of clean text from the National Library of Spain's web crawls. This model represents a significant advancement in Spanish natural language processing, leveraging data collected from 2009 to 2019 to create a robust foundation for various NLP tasks.

Implementation Details

The model was trained using a byte-level version of BPE tokenization with a 50,262 token vocabulary. Training was conducted over 48 hours using 16 computing nodes, each equipped with 4 NVIDIA V100 GPUs. The training corpus underwent extensive preprocessing, including sentence splitting, language detection, and deduplication, resulting in 201,080,084 documents with over 135 billion tokens.

  • Masked language modeling architecture based on RoBERTa
  • Extensive preprocessing pipeline for high-quality training data
  • Optimized for Spanish language understanding
  • State-of-the-art performance on multiple downstream tasks

Core Capabilities

  • Fill-mask task performance (primary pre-training objective)
  • Strong results in text classification (MLDoc F1: 0.9664)
  • Named Entity Recognition (CAPITEL-NERC F1: 0.8960)
  • Question Answering (SQAC F1: 0.7923)
  • Natural Language Inference (XNLI Accuracy: 0.8016)

Frequently Asked Questions

Q: What makes this model unique?

The model's uniqueness lies in its training on the largest Spanish corpus to date, compiled from the National Library of Spain's web crawls. This extensive dataset, combined with careful preprocessing and state-of-the-art architecture, makes it particularly effective for Spanish language tasks.

Q: What are the recommended use cases?

The model excels in masked language modeling tasks and can be fine-tuned for various downstream applications including question answering, text classification, and named entity recognition. It's particularly suitable for tasks requiring deep understanding of Spanish language context.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.