OREAL-DeepSeek-R1-Distill-Qwen-7B

Property	Value
Parameter Count	7 Billion
Model Type	Mathematical Reasoning Model
Author	internlm
Paper	arXiv:2502.06781
Model Link	Hugging Face

What is OREAL-DeepSeek-R1-Distill-Qwen-7B?

OREAL-DeepSeek-R1-Distill-Qwen-7B is a state-of-the-art mathematical reasoning model that leverages the Outcome REwArd-based reinforcement Learning (OREAL) framework. This model achieves remarkable 94.0% pass@1 accuracy on MATH-500, matching the performance of previous 32B models while using significantly fewer parameters.

Implementation Details

The model implements a novel RL framework designed specifically for tasks with binary outcome rewards. It utilizes best-of-N (BoN) sampling for behavior cloning and incorporates an on-policy token-level reward model to identify key tokens in reasoning trajectories.

Advanced reward reshaping mechanism for negative samples
Specialized system prompt for mathematical reasoning
Integration with existing chat templates for easy deployment
Support for multiple mathematical benchmarks

Core Capabilities

94.0% accuracy on MATH-500 benchmark
50.0% accuracy on AIME tests
65.6% performance on LiveMathBench
66.1% accuracy on OlympiadBench
Systematic approach to mathematical problem-solving

Frequently Asked Questions

Q: What makes this model unique?

The model's unique strength lies in its OREAL framework, which enables it to achieve 32B-model-level performance with only 7B parameters. It also features a sophisticated system prompt that guides systematic mathematical thinking and rigorous reasoning.

Q: What are the recommended use cases?

The model excels in mathematical competition problems, complex mathematical reasoning tasks, and educational applications requiring detailed step-by-step problem solving. It's particularly effective for tasks requiring deep mathematical understanding and systematic approach to problem-solving.