custom-legalbert

custom-legalbert

casehold

Custom Legal-BERT model trained on 37GB of Harvard Law cases (3.4M decisions) using MLM/NSP objectives with legal-specific tokenization and 32k vocab.

PropertyValue
Authorcasehold
Training DataHarvard Law case corpus (37GB)
Vocabulary Size32,000 tokens
PaperarXiv:2104.08671

What is custom-legalbert?

Custom Legal-BERT is a specialized language model pretrained specifically for legal domain tasks. Built on the BERT architecture, this model was trained from scratch on an extensive corpus of 3.4 million legal decisions from the Harvard Law case repository, spanning from 1965 to present. The training corpus (37GB) is significantly larger than the original BERT's BookCorpus/Wikipedia dataset (15GB), providing comprehensive coverage of legal terminology and concepts.

Implementation Details

The model implements several domain-specific optimizations:

  • Custom tokenization adapted specifically for legal text
  • Domain-specific legal vocabulary of 32,000 tokens created using SentencePiece
  • Trained for 2 million steps using Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) objectives
  • Specialized sentence segmentation designed for legal documents

Core Capabilities

  • Legal text classification tasks
  • Multiple choice legal reasoning (CaseHOLD dataset)
  • Legal document analysis
  • Case law understanding and processing

Frequently Asked Questions

Q: What makes this model unique?

This model's uniqueness lies in its specialized legal domain training, using one of the largest legal text corpora ever employed for model pretraining. The custom vocabulary and tokenization are specifically optimized for legal text, making it particularly effective for legal NLP tasks.

Q: What are the recommended use cases?

The model is particularly suited for tasks like case law analysis, legal document classification, and multiple choice reasoning about legal holdings (CaseHOLD). It's designed for applications requiring deep understanding of legal terminology and concepts.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026