OLMo-2-1124-7B-DPO
Property | Value |
---|---|
Base Model | OLMo-2-7B-SFT |
License | Apache 2.0 |
Language | English |
Paper | Forthcoming |
Training Method | Direct Preference Optimization (DPO) |
What is OLMo-2-1124-7B-DPO?
OLMo-2-1124-7B-DPO is a sophisticated language model developed by Allen Institute for AI (AI2) as part of their OLMo (Open Language Model) series. It's a DPO-trained variant of the base OLMo 2 7B model, specifically optimized using preference data to enhance its instruction-following capabilities and performance across diverse tasks.
Implementation Details
The model employs a length-normalized DPO training approach with carefully tuned hyperparameters including a learning rate of 8E-7, beta value of 5, and an effective batch size of 128. It supports a maximum sequence length of 2048 tokens and utilizes a linear learning rate schedule with a 0.1 warmup ratio.
- Built on the Transformers architecture
- Trained on OLMo-specific variant of Tülu 3 dataset
- Implements a standardized chat template for consistent interaction
- Supports both general text generation and specialized tasks
Core Capabilities
- Strong performance on mathematical reasoning (GSM8k: 82.4%)
- Robust safety measures (Safety score: 81.5%)
- Effective instruction following (AlpacaEval: 29.9%)
- Balanced performance across diverse tasks including DROP and MMLU
Frequently Asked Questions
Q: What makes this model unique?
The model stands out for being fully open-source while achieving competitive performance against other leading models in its size class. It's particularly notable for its balanced performance across various tasks and its transparent training process.
Q: What are the recommended use cases?
The model is well-suited for research and educational applications, particularly excelling in mathematical reasoning, instruction following, and general text generation tasks. It's designed to be a versatile tool while maintaining strong safety considerations.