NeoBERT

Property	Value
Parameter Count	250M
Context Length	4,096 tokens
Architecture	28 layers × 768 width
Training Data	RefinedWeb (2.8 TB)
License	MIT

What is NeoBERT?

NeoBERT represents a significant advancement in transformer-based language models, designed as a next-generation encoder for English text representation. Pre-trained from scratch on the massive RefinedWeb dataset, it combines modern architectural improvements with optimized training methodologies while maintaining a relatively compact 250M parameter footprint.

Implementation Details

The model incorporates several cutting-edge technical features that contribute to its exceptional performance:

SwiGLU activation function for enhanced processing capabilities
RoPE (Rotary Positional Embeddings) for better position understanding
Pre-RMSNorm for stable training
FlashAttention for computational efficiency
20% MLM masking rate during pre-training
Trained on 2.1T tokens using AdamW optimizer with Cosine Decay

Core Capabilities

State-of-the-art performance on the MTEB benchmark
Extended context length of 4,096 tokens
Plug-and-play replacement for existing base models
Efficient processing with optimized depth-to-width ratio
Superior performance compared to larger models like BERT large and RoBERTa large

Frequently Asked Questions

Q: What makes this model unique?

NeoBERT stands out through its optimal balance of efficiency and performance, achieving state-of-the-art results despite its modest 250M parameter count. It incorporates modern architectural improvements while maintaining compatibility with existing BERT-based workflows.

Q: What are the recommended use cases?

The model is ideal for general-purpose text representation tasks, particularly when efficiency is crucial. It's especially suitable for applications requiring longer context understanding (up to 4,096 tokens) and can serve as a drop-in replacement for existing BERT-based models.

NeoBERT

NeoBERT

What is NeoBERT?

Implementation Details

Core Capabilities

Frequently Asked Questions

Q: What makes this model unique?

Q: What are the recommended use cases?

Related Models