Mamba-2.8B-Zephyr
Property | Value |
---|---|
Base Model | state-spaces/mamba-2.8b-slimpj |
Training Method | Direct Preference Optimization (DPO) |
Dataset | UltraFeedback Binarized |
Model Size | 2.8B parameters |
Accuracy | 78.57% |
Author | xiuyul |
What is mamba-2.8b-zephyr?
Mamba-2.8b-zephyr is an advanced language model that builds upon the Mamba architecture, specifically fine-tuned using Direct Preference Optimization (DPO) on the UltraFeedback dataset. The model represents a significant advancement in preference-aligned language modeling, achieving impressive accuracy rates in distinguishing preferred responses.
Implementation Details
The model was developed through a two-stage process: first, the base model (mamba-2.8b-slimpj) was instruction-tuned on the UltraChat 200k dataset, followed by preference optimization using DPO on the UltraFeedback binarized dataset. Training utilized multi-GPU setup across 8 devices with carefully tuned hyperparameters including a learning rate of 5e-07 and linear scheduling with 0.1 warmup ratio.
- Trained over 3 epochs with batch size of 64
- Implemented Adam optimizer with betas=(0.9,0.999)
- Achieved final validation loss of 0.4996
- Demonstrated strong preference learning with reward margin of 1.1582
Core Capabilities
- High accuracy (78.57%) in preference alignment
- Effective distinction between chosen and rejected responses
- Robust performance across varied input contexts
- Optimized for instruction following and preference learning
Frequently Asked Questions
Q: What makes this model unique?
The model stands out for its use of the Mamba architecture combined with DPO training, resulting in strong preference alignment capabilities while maintaining efficient processing characteristics of state-space models.
Q: What are the recommended use cases?
While specific use cases aren't detailed in the model card, the model's strong preference alignment makes it suitable for tasks requiring nuanced understanding of user preferences and high-quality instruction following.