r1-aqa
Property | Value |
---|---|
Base Model | Qwen2-Audio-7B-Instruct |
Training Method | Group Relative Policy Optimization (GRPO) |
Model URL | https://huggingface.co/mispeech/r1-aqa |
Author | mispeech |
What is r1-aqa?
r1-aqa is a state-of-the-art Audio Question Answering (AQA) model that builds upon the Qwen2-Audio-7B-Instruct architecture. What makes it unique is its optimization through reinforcement learning using the Group Relative Policy Optimization (GRPO) algorithm. The model has achieved remarkable results on the MMAU Test-mini benchmark, demonstrating superior performance across sound, music, and speech understanding tasks with only 38k post-training samples.
Implementation Details
The model leverages the Transformers library and can process audio inputs alongside text prompts. It utilizes bfloat16 precision for efficient inference and supports automatic device mapping. The implementation includes sophisticated audio processing capabilities and a chat template system for structured input handling.
- Built on Qwen2-Audio-7B-Instruct architecture
- Optimized using GRPO reinforcement learning
- Supports 16kHz audio input
- Implements structured output formatting with answer tags
Core Capabilities
- Sound understanding: 69.37% accuracy
- Music analysis: 66.77% accuracy
- Speech comprehension: 57.36% accuracy
- Overall average performance: 64.50% accuracy on MMAU Test-mini
- Outperforms direct inference methods and other state-of-the-art models
Frequently Asked Questions
Q: What makes this model unique?
This model stands out due to its use of reinforcement learning (GRPO) for optimization, which has proven more effective than traditional supervised fine-tuning approaches. It achieves state-of-the-art performance on the MMAU Test-mini benchmark with minimal post-training samples.
Q: What are the recommended use cases?
The model is particularly well-suited for audio question answering tasks across various domains including sound event recognition, music analysis, and speech comprehension. It can process audio inputs and provide structured responses based on specific questions about the audio content.