Starling-RM-7B-alpha
Property | Value |
---|---|
Base Model | Llama-2-7B-Chat |
Training Dataset | berkeley-nest/Nectar |
License | Apache-2.0 |
Primary Use | Reward Model for RLHF |
What is Starling-RM-7B-alpha?
Starling-RM-7B-alpha is an advanced reward model derived from Llama-2-7B-Chat, specifically designed for Reinforcement Learning from Human Feedback (RLHF) and AI Feedback (RLAIF). The model implements a novel approach where the final layer of Llama-2-7B-Chat is replaced with a linear layer that produces scalar outputs for prompt-response pairs, trained on the berkeley-nest/Nectar preference dataset.
Implementation Details
The model architecture follows the InstructGPT paper methodology, utilizing a K-wise maximum likelihood estimator for training. It processes input pairs and generates scalar reward scores indicating response quality in terms of helpfulness and harm reduction.
- Modified architecture with custom linear output layer
- Training aligned with GPT-4 preferences
- Maximum sequence length of 2048 tokens
- Efficient reward scoring mechanism
Core Capabilities
- Scalar reward generation for prompt-response pairs
- Preference-based response evaluation
- Integration with RLHF/RLAIF pipelines
- Support for batch processing with customizable batch sizes
Frequently Asked Questions
Q: What makes this model unique?
The model's distinctive feature is its specialized training on GPT-4 preferences through the Nectar dataset, making it particularly effective at evaluating responses based on both helpfulness and harm reduction criteria.
Q: What are the recommended use cases?
The model is primarily designed for training other language models through reinforcement learning, particularly in scenarios requiring alignment with human preferences and ethical considerations. It's especially useful in RLHF and RLAIF pipelines.