legal-croatian-roberta-base

Property	Value
Parameter Count	111M
License	CC BY-SA
Paper	MultiLegalPile Paper
Language	Croatian
Author	Joel Niklaus

What is legal-croatian-roberta-base?

legal-croatian-roberta-base is a specialized language model designed specifically for Croatian legal text processing. Built on the RoBERTa architecture and initialized from XLM-R, this model has been extensively trained on the Croatian portion of the Multi Legal Pile dataset, making it particularly effective for legal domain tasks.

Implementation Details

The model employs a sophisticated training approach, including warm-starting from XLM-R checkpoints and utilizing a custom 128K BPE tokenizer optimized for legal terminology. Training was conducted on Google TPU v3-8 hardware with a batch size of 512 samples and 1M training steps.

Custom tokenizer with 128K BPEs for better legal language coverage
Warm-up phase with focused embedding updates
Increased masking rate of 20% during training
Exponential smoothing for balanced sentence sampling

Core Capabilities

Masked language modeling for legal text understanding
Fine-tuning support for sequence classification
Token classification capabilities
Question answering tasks in legal domain

Frequently Asked Questions

Q: What makes this model unique?

This model is specifically optimized for Croatian legal text processing, trained on a comprehensive legal corpus with domain-specific tokenization and training procedures. Its specialized nature makes it particularly effective for legal NLP tasks in Croatian.

Q: What are the recommended use cases?

The model is best suited for tasks that require understanding of legal text context, including sequence classification, token classification, and question answering. It's not recommended for text generation tasks, where models like GPT-2 would be more appropriate.