Mamba-2.8B-Zephyr

Property	Value
Base Model	state-spaces/mamba-2.8b-slimpj
Training Method	Direct Preference Optimization (DPO)
Dataset	UltraFeedback Binarized
Model Size	2.8B parameters
Accuracy	78.57%
Author	xiuyul

What is mamba-2.8b-zephyr?

Mamba-2.8b-zephyr is an advanced language model that builds upon the Mamba architecture, specifically fine-tuned using Direct Preference Optimization (DPO) on the UltraFeedback dataset. The model represents a significant advancement in preference-aligned language modeling, achieving impressive accuracy rates in distinguishing preferred responses.

Implementation Details

The model was developed through a two-stage process: first, the base model (mamba-2.8b-slimpj) was instruction-tuned on the UltraChat 200k dataset, followed by preference optimization using DPO on the UltraFeedback binarized dataset. Training utilized multi-GPU setup across 8 devices with carefully tuned hyperparameters including a learning rate of 5e-07 and linear scheduling with 0.1 warmup ratio.

Trained over 3 epochs with batch size of 64
Implemented Adam optimizer with betas=(0.9,0.999)
Achieved final validation loss of 0.4996
Demonstrated strong preference learning with reward margin of 1.1582

Core Capabilities

High accuracy (78.57%) in preference alignment
Effective distinction between chosen and rejected responses
Robust performance across varied input contexts
Optimized for instruction following and preference learning

Frequently Asked Questions

Q: What makes this model unique?

The model stands out for its use of the Mamba architecture combined with DPO training, resulting in strong preference alignment capabilities while maintaining efficient processing characteristics of state-space models.

Q: What are the recommended use cases?

While specific use cases aren't detailed in the model card, the model's strong preference alignment makes it suitable for tasks requiring nuanced understanding of user preferences and high-quality instruction following.

mamba-2.8b-zephyr