legalbert-large-1.7M-1

Maintained By
pile-of-law

LegalBERT Large 1.7M-1

PropertyValue
Model TypeBERT Large (Uncased)
Training Steps1.7 Million
Vocabulary Size32,000 tokens
Training Data256GB Pile of Law Dataset
LicenseCC Attribution-NonCommercial-ShareAlike 4.0
PaperarXiv:2207.00220

What is legalbert-large-1.7M-1?

LegalBERT Large 1.7M-1 is a specialized BERT model trained specifically for legal and administrative text processing. It's built on the BERT large architecture and has been pretrained on the extensive Pile of Law dataset, comprising approximately 256GB of legal text from 35 different sources including court opinions, legal analyses, and government publications.

Implementation Details

The model features a custom vocabulary of 32,000 tokens, combining 29,000 word-piece tokens with 3,000 specialized legal terms from Black's Law Dictionary. It was trained using RoBERTa's masked language modeling objective without NSP loss, utilizing a SambaNova cluster with 8 RDUs. The training process employed a conservative learning rate of 5e-6 and a batch size of 128 to ensure stability across diverse legal sources.

  • Custom legal vocabulary integration
  • Specialized sentence segmentation for legal citations
  • 512 token sequence length
  • 80-10-10 masking strategy with 20x replication rate

Core Capabilities

  • Legal text comprehension and analysis
  • Masked language modeling for legal documents
  • Support for downstream legal tasks
  • High performance on CaseHOLD benchmark (75.0 F1)

Frequently Asked Questions

Q: What makes this model unique?

The model's uniqueness lies in its specialized legal training data and custom vocabulary that includes specific legal terminology. It's one of the largest models trained exclusively on legal text, making it particularly effective for legal domain tasks.

Q: What are the recommended use cases?

The model is best suited for legal text analysis, document comprehension, and specialized legal NLP tasks. It can be used for masked language modeling out of the box or fine-tuned for specific downstream legal applications like case analysis, legal document processing, or legal research assistance.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.