Legal BERTimbau: Portuguese Legal Language Model
Property | Value |
---|---|
Parameter Count | 334M |
License | MIT |
Language | Portuguese |
Framework | PyTorch, Transformers |
Primary Task | Sentence Similarity |
What is bert-large-portuguese-cased-legal-mlm-nli-sts-v1?
This is a specialized BERT model designed specifically for Portuguese legal text analysis. Built upon the BERTimbau architecture, it has been extensively trained on legal documents and optimized for semantic similarity tasks. The model maps sentences and paragraphs to a 1024-dimensional dense vector space, making it particularly effective for clustering and semantic search applications in legal contexts.
Implementation Details
The model underwent a comprehensive training process including MLM training on 30,000 legal documents with 15,000 training steps, followed by NLI training and fine-tuning for Semantic Textual Similarity using multiple datasets including ASSIN, ASSIN2, and STSB multi_mt.
- Masked Language Model training with 1e-5 learning rate
- NLI training with 16 batch size and 2e-5 learning rate
- STS fine-tuning with specialized legal datasets
- 1024-dimensional output embeddings
Core Capabilities
- Semantic similarity computation for legal texts
- Dense vector representation of legal documents
- Support for both sentence-transformers and HuggingFace implementations
- High performance on Portuguese legal document analysis
Frequently Asked Questions
Q: What makes this model unique?
This model is specifically optimized for Portuguese legal text analysis, combining the power of BERT architecture with domain-specific training on legal documents. It achieves impressive correlation scores on various benchmark datasets, with Pearson correlations ranging from 0.77 to 0.83.
Q: What are the recommended use cases?
The model is ideal for legal document analysis tasks including semantic search in legal databases, document similarity comparison, and legal text clustering. It's particularly well-suited for applications in Portuguese legal institutions and research.