Custom Legal-BERT
Property | Value |
---|---|
Author | casehold |
Training Data | Harvard Law case corpus (37GB) |
Vocabulary Size | 32,000 tokens |
Paper | arXiv:2104.08671 |
What is custom-legalbert?
Custom Legal-BERT is a specialized language model pretrained specifically for legal domain tasks. Built on the BERT architecture, this model was trained from scratch on an extensive corpus of 3.4 million legal decisions from the Harvard Law case repository, spanning from 1965 to present. The training corpus (37GB) is significantly larger than the original BERT's BookCorpus/Wikipedia dataset (15GB), providing comprehensive coverage of legal terminology and concepts.
Implementation Details
The model implements several domain-specific optimizations:
- Custom tokenization adapted specifically for legal text
- Domain-specific legal vocabulary of 32,000 tokens created using SentencePiece
- Trained for 2 million steps using Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) objectives
- Specialized sentence segmentation designed for legal documents
Core Capabilities
- Legal text classification tasks
- Multiple choice legal reasoning (CaseHOLD dataset)
- Legal document analysis
- Case law understanding and processing
Frequently Asked Questions
Q: What makes this model unique?
This model's uniqueness lies in its specialized legal domain training, using one of the largest legal text corpora ever employed for model pretraining. The custom vocabulary and tokenization are specifically optimized for legal text, making it particularly effective for legal NLP tasks.
Q: What are the recommended use cases?
The model is particularly suited for tasks like case law analysis, legal document classification, and multiple choice reasoning about legal holdings (CaseHOLD). It's designed for applications requiring deep understanding of legal terminology and concepts.