OREAL-DeepSeek-R1-Distill-Qwen-7B
Property | Value |
---|---|
Parameter Count | 7 Billion |
Model Type | Mathematical Reasoning Model |
Author | internlm |
Paper | arXiv:2502.06781 |
Model Link | Hugging Face |
What is OREAL-DeepSeek-R1-Distill-Qwen-7B?
OREAL-DeepSeek-R1-Distill-Qwen-7B is a state-of-the-art mathematical reasoning model that leverages the Outcome REwArd-based reinforcement Learning (OREAL) framework. This model achieves remarkable 94.0% pass@1 accuracy on MATH-500, matching the performance of previous 32B models while using significantly fewer parameters.
Implementation Details
The model implements a novel RL framework designed specifically for tasks with binary outcome rewards. It utilizes best-of-N (BoN) sampling for behavior cloning and incorporates an on-policy token-level reward model to identify key tokens in reasoning trajectories.
- Advanced reward reshaping mechanism for negative samples
- Specialized system prompt for mathematical reasoning
- Integration with existing chat templates for easy deployment
- Support for multiple mathematical benchmarks
Core Capabilities
- 94.0% accuracy on MATH-500 benchmark
- 50.0% accuracy on AIME tests
- 65.6% performance on LiveMathBench
- 66.1% accuracy on OlympiadBench
- Systematic approach to mathematical problem-solving
Frequently Asked Questions
Q: What makes this model unique?
The model's unique strength lies in its OREAL framework, which enables it to achieve 32B-model-level performance with only 7B parameters. It also features a sophisticated system prompt that guides systematic mathematical thinking and rigorous reasoning.
Q: What are the recommended use cases?
The model excels in mathematical competition problems, complex mathematical reasoning tasks, and educational applications requiring detailed step-by-step problem solving. It's particularly effective for tasks requiring deep mathematical understanding and systematic approach to problem-solving.