OLMo-2-1124-7B-RM

Property	Value
License	Apache 2.0
Base Model	OLMo-2-1124-7B-SFT
Paper	Forthcoming
Training Data	Tülu 3 dataset & preference dataset

What is OLMo-2-1124-7B-RM?

OLMo-2-1124-7B-RM is a reward model developed by Allen AI, built upon their OLMo-2-7B-SFT foundation. This specialized model is designed to serve as a reward model for reinforcement learning, specifically trained to evaluate and guide the quality of AI-generated responses. The model leverages an OLMo-specific variant of the Tülu 3 dataset and a custom preference dataset, making it particularly effective for initializing value models during RLVR training.

Implementation Details

The model was trained with specific hyperparameters including a learning rate of 3E-6, an effective batch size of 256, and a maximum sequence length of 4096. Training was conducted over a single epoch without a specific learning rate schedule. The model utilizes a specialized chat template format and can be loaded using HuggingFace's transformers library with custom modifications.

Custom branch installation required for implementation
Supports sequence classification tasks
Incorporates standardized chat template format
Compatible with standard system prompts

Core Capabilities

Reward modeling for AI response evaluation
Support for RLVR training initialization
Sequence classification functionality
Handling of complex dialogue interactions
Integration with both 7B and 13B RLVR training pipelines

Frequently Asked Questions

Q: What makes this model unique?

This model is specifically designed as a reward model for the OLMo ecosystem, trained on a carefully curated mix of preference data. It serves as a crucial component in the training pipeline for both 7B and 13B RLVR models, making it essential for developing more capable instruction-following AI systems.

Q: What are the recommended use cases?

The primary use case is as an initialization point for value models during RLVR training. It's not intended for direct deployment in applications but rather serves as a component in the training pipeline for developing more sophisticated AI models.

OLMo-2-1124-7B-RM

OLMo-2-1124-7B-RM

What is OLMo-2-1124-7B-RM?

Implementation Details

Core Capabilities

Frequently Asked Questions

Q: What makes this model unique?

Q: What are the recommended use cases?

Related Models