custom-legalbert

Maintained By
casehold

Custom Legal-BERT

PropertyValue
Authorcasehold
Training DataHarvard Law case corpus (37GB)
Vocabulary Size32,000 tokens
PaperarXiv:2104.08671

What is custom-legalbert?

Custom Legal-BERT is a specialized language model pretrained specifically for legal domain tasks. Built on the BERT architecture, this model was trained from scratch on an extensive corpus of 3.4 million legal decisions from the Harvard Law case repository, spanning from 1965 to present. The training corpus (37GB) is significantly larger than the original BERT's BookCorpus/Wikipedia dataset (15GB), providing comprehensive coverage of legal terminology and concepts.

Implementation Details

The model implements several domain-specific optimizations:

  • Custom tokenization adapted specifically for legal text
  • Domain-specific legal vocabulary of 32,000 tokens created using SentencePiece
  • Trained for 2 million steps using Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) objectives
  • Specialized sentence segmentation designed for legal documents

Core Capabilities

  • Legal text classification tasks
  • Multiple choice legal reasoning (CaseHOLD dataset)
  • Legal document analysis
  • Case law understanding and processing

Frequently Asked Questions

Q: What makes this model unique?

This model's uniqueness lies in its specialized legal domain training, using one of the largest legal text corpora ever employed for model pretraining. The custom vocabulary and tokenization are specifically optimized for legal text, making it particularly effective for legal NLP tasks.

Q: What are the recommended use cases?

The model is particularly suited for tasks like case law analysis, legal document classification, and multiple choice reasoning about legal holdings (CaseHOLD). It's designed for applications requiring deep understanding of legal terminology and concepts.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.