legalbert-large-1.7M-1

pile-of-law

Large-scale BERT model (32k vocab) trained on 256GB legal corpus, optimized for legal text analysis with 1.7M training steps and custom legal vocabulary

Property	Value
Model Type	BERT Large (Uncased)
Training Steps	1.7 Million
Vocabulary Size	32,000 tokens
Training Data	256GB Pile of Law Dataset
License	CC Attribution-NonCommercial-ShareAlike 4.0
Paper	arXiv:2207.00220

What is legalbert-large-1.7M-1?

LegalBERT Large 1.7M-1 is a specialized BERT model trained specifically for legal and administrative text processing. It's built on the BERT large architecture and has been pretrained on the extensive Pile of Law dataset, comprising approximately 256GB of legal text from 35 different sources including court opinions, legal analyses, and government publications.

Implementation Details

The model features a custom vocabulary of 32,000 tokens, combining 29,000 word-piece tokens with 3,000 specialized legal terms from Black's Law Dictionary. It was trained using RoBERTa's masked language modeling objective without NSP loss, utilizing a SambaNova cluster with 8 RDUs. The training process employed a conservative learning rate of 5e-6 and a batch size of 128 to ensure stability across diverse legal sources.

Custom legal vocabulary integration
Specialized sentence segmentation for legal citations
512 token sequence length
80-10-10 masking strategy with 20x replication rate

Core Capabilities

Legal text comprehension and analysis
Masked language modeling for legal documents
Support for downstream legal tasks
High performance on CaseHOLD benchmark (75.0 F1)

Frequently Asked Questions

Q: What makes this model unique?

The model's uniqueness lies in its specialized legal training data and custom vocabulary that includes specific legal terminology. It's one of the largest models trained exclusively on legal text, making it particularly effective for legal domain tasks.

Q: What are the recommended use cases?

The model is best suited for legal text analysis, document comprehension, and specialized legal NLP tasks. It can be used for masked language modeling out of the box or fine-tuned for specific downstream legal applications like case analysis, legal document processing, or legal research assistance.