RoBERTalex
Property | Value |
---|---|
Architecture | RoBERTa base |
Language | Spanish |
Training Data | 8.9GB Legal Domain Corpora |
License | Apache 2.0 |
Paper | arXiv:2110.12201 |
What is RoBERTalex?
RoBERTalex is a specialized Spanish language model based on RoBERTa, specifically trained on legal domain texts. Developed by the Text Mining Unit at Barcelona Supercomputing Center, it represents a significant advancement in Spanish legal NLP applications. The model was trained on a comprehensive 8.9GB corpus of legal texts, making it particularly adept at understanding and processing Spanish legal language.
Implementation Details
The model utilizes a byte-version of BPE tokenization with a vocabulary size of 50,262 tokens. Training was conducted using 2 computing nodes, each equipped with 4 NVIDIA V100 GPUs with 16GB VRAM. The model follows RoBERTa's masked language modeling approach and has achieved impressive performance metrics across various NLP tasks.
- Trained on preprocessed legal corpora with sentence splitting and language detection
- Implements document boundary preservation during training
- Uses RoBERTa base architecture with Spanish legal domain specialization
Core Capabilities
- Masked Language Modeling with state-of-the-art performance
- 98.71% F1 score on UD-POS tagging
- 83.23% F1 score on CoNLL-NERC
- 73.74% Combined score on STS tasks
- Fine-tuning capability for downstream tasks like Question Answering and Text Classification
Frequently Asked Questions
Q: What makes this model unique?
RoBERTalex is specifically optimized for Spanish legal text processing, trained on an extensive legal corpus, making it particularly effective for legal domain applications while maintaining strong performance on general language tasks.
Q: What are the recommended use cases?
The model excels in masked language modeling tasks and can be fine-tuned for various downstream applications including Question Answering, Text Classification, and Named Entity Recognition in legal contexts. It's particularly suitable for Spanish legal document processing and analysis.