NeoBERT
Property | Value |
---|---|
Parameter Count | 250M |
Context Length | 4,096 tokens |
Architecture | 28 layers × 768 width |
Training Data | RefinedWeb (2.8 TB) |
License | MIT |
What is NeoBERT?
NeoBERT represents a significant advancement in transformer-based language models, designed as a next-generation encoder for English text representation. Pre-trained from scratch on the massive RefinedWeb dataset, it combines modern architectural improvements with optimized training methodologies while maintaining a relatively compact 250M parameter footprint.
Implementation Details
The model incorporates several cutting-edge technical features that contribute to its exceptional performance:
- SwiGLU activation function for enhanced processing capabilities
- RoPE (Rotary Positional Embeddings) for better position understanding
- Pre-RMSNorm for stable training
- FlashAttention for computational efficiency
- 20% MLM masking rate during pre-training
- Trained on 2.1T tokens using AdamW optimizer with Cosine Decay
Core Capabilities
- State-of-the-art performance on the MTEB benchmark
- Extended context length of 4,096 tokens
- Plug-and-play replacement for existing base models
- Efficient processing with optimized depth-to-width ratio
- Superior performance compared to larger models like BERT large and RoBERTa large
Frequently Asked Questions
Q: What makes this model unique?
NeoBERT stands out through its optimal balance of efficiency and performance, achieving state-of-the-art results despite its modest 250M parameter count. It incorporates modern architectural improvements while maintaining compatibility with existing BERT-based workflows.
Q: What are the recommended use cases?
The model is ideal for general-purpose text representation tasks, particularly when efficiency is crucial. It's especially suitable for applications requiring longer context understanding (up to 4,096 tokens) and can serve as a drop-in replacement for existing BERT-based models.