NeoBERT

NeoBERT

chandar-lab

NeoBERT - A 250M parameter next-gen BERT model trained on RefinedWeb, featuring 4096 token context length and state-of-the-art MTEB benchmark performance.

PropertyValue
Parameter Count250M
Context Length4,096 tokens
Architecture28 layers × 768 width
Training DataRefinedWeb (2.8 TB)
LicenseMIT

What is NeoBERT?

NeoBERT represents a significant advancement in transformer-based language models, designed as a next-generation encoder for English text representation. Pre-trained from scratch on the massive RefinedWeb dataset, it combines modern architectural improvements with optimized training methodologies while maintaining a relatively compact 250M parameter footprint.

Implementation Details

The model incorporates several cutting-edge technical features that contribute to its exceptional performance:

  • SwiGLU activation function for enhanced processing capabilities
  • RoPE (Rotary Positional Embeddings) for better position understanding
  • Pre-RMSNorm for stable training
  • FlashAttention for computational efficiency
  • 20% MLM masking rate during pre-training
  • Trained on 2.1T tokens using AdamW optimizer with Cosine Decay

Core Capabilities

  • State-of-the-art performance on the MTEB benchmark
  • Extended context length of 4,096 tokens
  • Plug-and-play replacement for existing base models
  • Efficient processing with optimized depth-to-width ratio
  • Superior performance compared to larger models like BERT large and RoBERTa large

Frequently Asked Questions

Q: What makes this model unique?

NeoBERT stands out through its optimal balance of efficiency and performance, achieving state-of-the-art results despite its modest 250M parameter count. It incorporates modern architectural improvements while maintaining compatibility with existing BERT-based workflows.

Q: What are the recommended use cases?

The model is ideal for general-purpose text representation tasks, particularly when efficiency is crucial. It's especially suitable for applications requiring longer context understanding (up to 4,096 tokens) and can serve as a drop-in replacement for existing BERT-based models.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026