Starling-RM-7B-alpha

Property	Value
Base Model	Llama-2-7B-Chat
Training Dataset	berkeley-nest/Nectar
License	Apache-2.0
Primary Use	Reward Model for RLHF

What is Starling-RM-7B-alpha?

Starling-RM-7B-alpha is an advanced reward model derived from Llama-2-7B-Chat, specifically designed for Reinforcement Learning from Human Feedback (RLHF) and AI Feedback (RLAIF). The model implements a novel approach where the final layer of Llama-2-7B-Chat is replaced with a linear layer that produces scalar outputs for prompt-response pairs, trained on the berkeley-nest/Nectar preference dataset.

Implementation Details

The model architecture follows the InstructGPT paper methodology, utilizing a K-wise maximum likelihood estimator for training. It processes input pairs and generates scalar reward scores indicating response quality in terms of helpfulness and harm reduction.

Modified architecture with custom linear output layer
Training aligned with GPT-4 preferences
Maximum sequence length of 2048 tokens
Efficient reward scoring mechanism

Core Capabilities

Scalar reward generation for prompt-response pairs
Preference-based response evaluation
Integration with RLHF/RLAIF pipelines
Support for batch processing with customizable batch sizes

Frequently Asked Questions

Q: What makes this model unique?

The model's distinctive feature is its specialized training on GPT-4 preferences through the Nectar dataset, making it particularly effective at evaluating responses based on both helpfulness and harm reduction criteria.

Q: What are the recommended use cases?

The model is primarily designed for training other language models through reinforcement learning, particularly in scenarios requiring alignment with human preferences and ethical considerations. It's especially useful in RLHF and RLAIF pipelines.