ModernBERT-base

Maintained By
answerdotai

ModernBERT-base

PropertyValue
Parameter Count149 million
Context Length8,192 tokens
Training Data2 trillion tokens
LicenseApache 2.0
PaperarXiv:2412.13663

What is ModernBERT-base?

ModernBERT-base is a state-of-the-art bidirectional encoder model that modernizes the traditional BERT architecture with cutting-edge improvements. It represents a significant advancement in transformer-based models, trained on an extensive dataset of 2 trillion tokens spanning both English text and code.

Implementation Details

The model implements several modern architectural improvements including Rotary Positional Embeddings (RoPE), Local-Global Alternating Attention, and Flash Attention support. It's built with a pre-norm transformer architecture featuring GeGLU activations and was trained using StableAdamW optimizer with trapezoidal learning rate scheduling.

  • 22 layers of transformer architecture
  • Native support for sequences up to 8,192 tokens
  • Efficient unpadding and Flash Attention optimization
  • Pre-trained on both text and code data

Core Capabilities

  • Superior performance on GLUE benchmark tasks
  • Excellent retrieval capabilities on BEIR and MLDR datasets
  • State-of-the-art results in code retrieval tasks
  • Efficient processing of long-context inputs
  • Strong performance in both single-vector and multi-vector retrieval settings

Frequently Asked Questions

Q: What makes this model unique?

ModernBERT-base stands out through its combination of modern architectural improvements, extensive training on diverse data, and efficient handling of long sequences. It achieves superior performance while maintaining practical inference speeds, particularly when used with Flash Attention 2.

Q: What are the recommended use cases?

The model excels in tasks requiring long document processing, including document retrieval, classification, and semantic search. It's particularly effective for code-related tasks and hybrid (text + code) semantic search applications. The model can be fine-tuned for specific downstream tasks following standard BERT fine-tuning approaches.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.