ModernBERT-base
Property | Value |
---|---|
Parameter Count | 149 million |
Context Length | 8,192 tokens |
Training Data | 2 trillion tokens |
License | Apache 2.0 |
Paper | arXiv:2412.13663 |
What is ModernBERT-base?
ModernBERT-base is a state-of-the-art bidirectional encoder model that modernizes the traditional BERT architecture with cutting-edge improvements. It represents a significant advancement in transformer-based models, trained on an extensive dataset of 2 trillion tokens spanning both English text and code.
Implementation Details
The model implements several modern architectural improvements including Rotary Positional Embeddings (RoPE), Local-Global Alternating Attention, and Flash Attention support. It's built with a pre-norm transformer architecture featuring GeGLU activations and was trained using StableAdamW optimizer with trapezoidal learning rate scheduling.
- 22 layers of transformer architecture
- Native support for sequences up to 8,192 tokens
- Efficient unpadding and Flash Attention optimization
- Pre-trained on both text and code data
Core Capabilities
- Superior performance on GLUE benchmark tasks
- Excellent retrieval capabilities on BEIR and MLDR datasets
- State-of-the-art results in code retrieval tasks
- Efficient processing of long-context inputs
- Strong performance in both single-vector and multi-vector retrieval settings
Frequently Asked Questions
Q: What makes this model unique?
ModernBERT-base stands out through its combination of modern architectural improvements, extensive training on diverse data, and efficient handling of long sequences. It achieves superior performance while maintaining practical inference speeds, particularly when used with Flash Attention 2.
Q: What are the recommended use cases?
The model excels in tasks requiring long document processing, including document retrieval, classification, and semantic search. It's particularly effective for code-related tasks and hybrid (text + code) semantic search applications. The model can be fine-tuned for specific downstream tasks following standard BERT fine-tuning approaches.