BETO: Spanish BERT (Uncased)

Property	Value
Developer	dccuchile
License	CC BY 4.0 (with disclaimers)
Downloads	309,301
Vocabulary Size	31k BPE subwords
Training Steps	2M

What is bert-base-spanish-wwm-uncased?

BETO is a specialized Spanish BERT model trained on a comprehensive Spanish corpus using the Whole Word Masking technique. This uncased version represents a significant advancement in Spanish natural language processing, comparable in size to BERT-Base while being specifically optimized for Spanish language tasks.

Implementation Details

The model utilizes a vocabulary of approximately 31,000 BPE subwords constructed using SentencePiece and underwent training for 2 million steps. It's implemented using both PyTorch and TensorFlow frameworks, making it versatile for different development environments.

Trained with Whole Word Masking technique for better contextual understanding
Achieves state-of-the-art performance on multiple Spanish NLP benchmarks
Compatible with both PyTorch and TensorFlow frameworks

Core Capabilities

Part-of-Speech Tagging (POS): 98.44% accuracy
Named Entity Recognition (NER): 82.67% accuracy
Document Classification (MLDoc): 96.12% accuracy
Natural Language Inference (XNLI): 80.15% accuracy
Paraphrase Identification (PAWS-X): 89.55% accuracy

Frequently Asked Questions

Q: What makes this model unique?

BETO stands out for its specialized training on Spanish language data using Whole Word Masking, consistently outperforming multilingual BERT models on Spanish-specific tasks. It's particularly notable for achieving state-of-the-art results in MLDoc classification (96.12%) and competitive performance across other benchmarks.

Q: What are the recommended use cases?

The model is particularly well-suited for Spanish language processing tasks including text classification, named entity recognition, part-of-speech tagging, and natural language inference. It's ideal for both academic research and production applications requiring sophisticated Spanish language understanding.

bert-base-spanish-wwm-uncased