legal-croatian-roberta-base
Property | Value |
---|---|
Parameter Count | 111M |
License | CC BY-SA |
Paper | MultiLegalPile Paper |
Language | Croatian |
Author | Joel Niklaus |
What is legal-croatian-roberta-base?
legal-croatian-roberta-base is a specialized language model designed specifically for Croatian legal text processing. Built on the RoBERTa architecture and initialized from XLM-R, this model has been extensively trained on the Croatian portion of the Multi Legal Pile dataset, making it particularly effective for legal domain tasks.
Implementation Details
The model employs a sophisticated training approach, including warm-starting from XLM-R checkpoints and utilizing a custom 128K BPE tokenizer optimized for legal terminology. Training was conducted on Google TPU v3-8 hardware with a batch size of 512 samples and 1M training steps.
- Custom tokenizer with 128K BPEs for better legal language coverage
- Warm-up phase with focused embedding updates
- Increased masking rate of 20% during training
- Exponential smoothing for balanced sentence sampling
Core Capabilities
- Masked language modeling for legal text understanding
- Fine-tuning support for sequence classification
- Token classification capabilities
- Question answering tasks in legal domain
Frequently Asked Questions
Q: What makes this model unique?
This model is specifically optimized for Croatian legal text processing, trained on a comprehensive legal corpus with domain-specific tokenization and training procedures. Its specialized nature makes it particularly effective for legal NLP tasks in Croatian.
Q: What are the recommended use cases?
The model is best suited for tasks that require understanding of legal text context, including sequence classification, token classification, and question answering. It's not recommended for text generation tasks, where models like GPT-2 would be more appropriate.