bertin-roberta-base-spanish

bertin-roberta-base-spanish

bertin-project

Spanish RoBERTa base model trained on perplexity-sampled mC4 data. 125M parameters, achieves SOTA on MLDoc and competitive NER/POS performance.

PropertyValue
Parameter Count125M
LicenseCC-BY-4.0
PaperBERTIN: Efficient Pre-Training of a Spanish Language Model using Perplexity Sampling
Training DataSampled Spanish mC4 Dataset

What is bertin-roberta-base-spanish?

BERTIN is a RoBERTa-based language model trained specifically for Spanish text processing. What makes it unique is its innovative training approach using perplexity sampling, which allowed the team to train a competitive model using just one-fifth of the traditional data volume. The model achieves state-of-the-art performance on several Spanish language tasks while being trained with significantly fewer resources.

Implementation Details

The model was trained using Flax/JAX on TPUv3-8 hardware, implementing a novel perplexity sampling technique to select high-quality training data from the Spanish portion of mC4. This approach enabled efficient training with only 50M samples instead of the full 416M available samples.

  • Architecture: RoBERTa base architecture with 125M parameters
  • Training Data: Carefully sampled subset of Spanish mC4 using perplexity-based selection
  • Training Infrastructure: 3 TPUv3-8 units for approximately 10 days
  • Sequence Length: Available in both 128 and 512 token versions

Core Capabilities

  • Masked Language Modeling with high accuracy (0.65-0.69)
  • State-of-the-art performance on MLDoc classification
  • Competitive results on NER (F1 0.8792) and POS tagging (F1 0.9662)
  • Efficient fine-tuning for downstream tasks
  • Specialized for Spanish language understanding

Frequently Asked Questions

Q: What makes this model unique?

BERTIN's key innovation is its perplexity sampling approach, which allows it to achieve competitive performance while using only 20% of the traditional training data volume. This makes it particularly valuable for teams with limited computational resources.

Q: What are the recommended use cases?

The model excels in various Spanish NLP tasks including: document classification, named entity recognition, part-of-speech tagging, and masked language modeling. It's particularly suitable for applications requiring deep Spanish language understanding with resource constraints.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026