longformer-base-4096-bne-es
Property | Value |
---|---|
Developer | PlanTL-GOB-ES |
License | Apache License 2.0 |
Training Data | 570GB Spanish text from BNE |
Context Length | 4096 tokens |
Vocabulary Size | 50,262 tokens |
What is longformer-base-4096-bne-es?
The longformer-base-4096-bne-es is an advanced Spanish language model derived from roberta-base-bne, specifically adapted to handle longer text sequences. Trained on an extensive corpus from the National Library of Spain (BNE), this model implements the Longformer architecture to process contexts up to 4096 tokens using a combination of sliding window and global attention mechanisms.
Implementation Details
The model was trained on a carefully curated dataset of 7.2GB containing documents under 4096 tokens, derived from a larger 570GB corpus. The training process utilized 8 computing nodes with 2 AMD MI50 GPUs each, running for 40 hours. It employs a byte version of BPE tokenization with a 50,262 token vocabulary.
- Combines sliding window and global attention mechanisms
- Built upon roberta-base-bne checkpoint
- Trained on high-quality Spanish text from BNE crawls (2009-2019)
- Extensive preprocessing including language detection and deduplication
Core Capabilities
- Masked Language Modeling (MLM) for Fill Mask tasks
- Fine-tuning potential for Question Answering (F1: 0.8026 on SQAC)
- Named Entity Recognition (F1: 0.8985 on CAPITEL-NERC)
- Text Classification (F1: 0.9608 on MLDoc)
- Natural Language Inference (Accuracy: 0.8210 on XNLI)
Frequently Asked Questions
Q: What makes this model unique?
This model stands out for its ability to process longer Spanish texts (up to 4096 tokens) while maintaining efficiency through its hybrid attention mechanism. It's specifically trained on high-quality Spanish content from the National Library of Spain.
Q: What are the recommended use cases?
The model excels in masked language modeling tasks and can be fine-tuned for various downstream applications including question answering, text classification, and named entity recognition. It's particularly useful for applications requiring analysis of longer Spanish text documents.