Custom Legal-BERT

Property	Value
Author	casehold
Training Data	Harvard Law case corpus (37GB)
Vocabulary Size	32,000 tokens
Paper	arXiv:2104.08671

What is custom-legalbert?

Custom Legal-BERT is a specialized language model pretrained specifically for legal domain tasks. Built on the BERT architecture, this model was trained from scratch on an extensive corpus of 3.4 million legal decisions from the Harvard Law case repository, spanning from 1965 to present. The training corpus (37GB) is significantly larger than the original BERT's BookCorpus/Wikipedia dataset (15GB), providing comprehensive coverage of legal terminology and concepts.

Implementation Details

The model implements several domain-specific optimizations:

Custom tokenization adapted specifically for legal text
Domain-specific legal vocabulary of 32,000 tokens created using SentencePiece
Trained for 2 million steps using Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) objectives
Specialized sentence segmentation designed for legal documents

Core Capabilities

Legal text classification tasks
Multiple choice legal reasoning (CaseHOLD dataset)
Legal document analysis
Case law understanding and processing

Frequently Asked Questions

Q: What makes this model unique?

This model's uniqueness lies in its specialized legal domain training, using one of the largest legal text corpora ever employed for model pretraining. The custom vocabulary and tokenization are specifically optimized for legal text, making it particularly effective for legal NLP tasks.

Q: What are the recommended use cases?

The model is particularly suited for tasks like case law analysis, legal document classification, and multiple choice reasoning about legal holdings (CaseHOLD). It's designed for applications requiring deep understanding of legal terminology and concepts.