roberta-base-bne

roberta-base-bne

PlanTL-GOB-ES

Spanish RoBERTa model trained on 570GB of BNE data. Excels at masked language modeling and NLP tasks. Strong performance on classification and NER.

PropertyValue
ArchitectureRoBERTa Base
LanguageSpanish
Training Data570GB BNE Corpus
LicenseApache 2.0
DeveloperPlanTL-GOB-ES

What is roberta-base-bne?

roberta-base-bne is a powerful Spanish language model based on the RoBERTa architecture, trained on an unprecedented 570GB of clean text from the National Library of Spain's web crawls. This model represents a significant advancement in Spanish natural language processing, leveraging data collected from 2009 to 2019 to create a robust foundation for various NLP tasks.

Implementation Details

The model was trained using a byte-level version of BPE tokenization with a 50,262 token vocabulary. Training was conducted over 48 hours using 16 computing nodes, each equipped with 4 NVIDIA V100 GPUs. The training corpus underwent extensive preprocessing, including sentence splitting, language detection, and deduplication, resulting in 201,080,084 documents with over 135 billion tokens.

  • Masked language modeling architecture based on RoBERTa
  • Extensive preprocessing pipeline for high-quality training data
  • Optimized for Spanish language understanding
  • State-of-the-art performance on multiple downstream tasks

Core Capabilities

  • Fill-mask task performance (primary pre-training objective)
  • Strong results in text classification (MLDoc F1: 0.9664)
  • Named Entity Recognition (CAPITEL-NERC F1: 0.8960)
  • Question Answering (SQAC F1: 0.7923)
  • Natural Language Inference (XNLI Accuracy: 0.8016)

Frequently Asked Questions

Q: What makes this model unique?

The model's uniqueness lies in its training on the largest Spanish corpus to date, compiled from the National Library of Spain's web crawls. This extensive dataset, combined with careful preprocessing and state-of-the-art architecture, makes it particularly effective for Spanish language tasks.

Q: What are the recommended use cases?

The model excels in masked language modeling tasks and can be fine-tuned for various downstream applications including question answering, text classification, and named entity recognition. It's particularly suitable for tasks requiring deep understanding of Spanish language context.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026