longformer-base-4096-bne-es

Property	Value
Developer	PlanTL-GOB-ES
License	Apache License 2.0
Training Data	570GB Spanish text from BNE
Context Length	4096 tokens
Vocabulary Size	50,262 tokens

What is longformer-base-4096-bne-es?

The longformer-base-4096-bne-es is an advanced Spanish language model derived from roberta-base-bne, specifically adapted to handle longer text sequences. Trained on an extensive corpus from the National Library of Spain (BNE), this model implements the Longformer architecture to process contexts up to 4096 tokens using a combination of sliding window and global attention mechanisms.

Implementation Details

The model was trained on a carefully curated dataset of 7.2GB containing documents under 4096 tokens, derived from a larger 570GB corpus. The training process utilized 8 computing nodes with 2 AMD MI50 GPUs each, running for 40 hours. It employs a byte version of BPE tokenization with a 50,262 token vocabulary.

Combines sliding window and global attention mechanisms
Built upon roberta-base-bne checkpoint
Trained on high-quality Spanish text from BNE crawls (2009-2019)
Extensive preprocessing including language detection and deduplication

Core Capabilities

Masked Language Modeling (MLM) for Fill Mask tasks
Fine-tuning potential for Question Answering (F1: 0.8026 on SQAC)
Named Entity Recognition (F1: 0.8985 on CAPITEL-NERC)
Text Classification (F1: 0.9608 on MLDoc)
Natural Language Inference (Accuracy: 0.8210 on XNLI)

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its ability to process longer Spanish texts (up to 4096 tokens) while maintaining efficiency through its hybrid attention mechanism. It's specifically trained on high-quality Spanish content from the National Library of Spain.

Q: What are the recommended use cases?

The model excels in masked language modeling tasks and can be fine-tuned for various downstream applications including question answering, text classification, and named entity recognition. It's particularly useful for applications requiring analysis of longer Spanish text documents.