OLMo-2-1124-7B-RM
Property | Value |
---|---|
License | Apache 2.0 |
Base Model | OLMo-2-1124-7B-SFT |
Paper | Forthcoming |
Training Data | Tülu 3 dataset & preference dataset |
What is OLMo-2-1124-7B-RM?
OLMo-2-1124-7B-RM is a reward model developed by Allen AI, built upon their OLMo-2-7B-SFT foundation. This specialized model is designed to serve as a reward model for reinforcement learning, specifically trained to evaluate and guide the quality of AI-generated responses. The model leverages an OLMo-specific variant of the Tülu 3 dataset and a custom preference dataset, making it particularly effective for initializing value models during RLVR training.
Implementation Details
The model was trained with specific hyperparameters including a learning rate of 3E-6, an effective batch size of 256, and a maximum sequence length of 4096. Training was conducted over a single epoch without a specific learning rate schedule. The model utilizes a specialized chat template format and can be loaded using HuggingFace's transformers library with custom modifications.
- Custom branch installation required for implementation
- Supports sequence classification tasks
- Incorporates standardized chat template format
- Compatible with standard system prompts
Core Capabilities
- Reward modeling for AI response evaluation
- Support for RLVR training initialization
- Sequence classification functionality
- Handling of complex dialogue interactions
- Integration with both 7B and 13B RLVR training pipelines
Frequently Asked Questions
Q: What makes this model unique?
This model is specifically designed as a reward model for the OLMo ecosystem, trained on a carefully curated mix of preference data. It serves as a crucial component in the training pipeline for both 7B and 13B RLVR models, making it essential for developing more capable instruction-following AI systems.
Q: What are the recommended use cases?
The primary use case is as an initialization point for value models during RLVR training. It's not intended for direct deployment in applications but rather serves as a component in the training pipeline for developing more sophisticated AI models.