r1-aqa

Property	Value
Base Model	Qwen2-Audio-7B-Instruct
Training Method	Group Relative Policy Optimization (GRPO)
Model URL	https://huggingface.co/mispeech/r1-aqa
Author	mispeech

What is r1-aqa?

r1-aqa is a state-of-the-art Audio Question Answering (AQA) model that builds upon the Qwen2-Audio-7B-Instruct architecture. What makes it unique is its optimization through reinforcement learning using the Group Relative Policy Optimization (GRPO) algorithm. The model has achieved remarkable results on the MMAU Test-mini benchmark, demonstrating superior performance across sound, music, and speech understanding tasks with only 38k post-training samples.

Implementation Details

The model leverages the Transformers library and can process audio inputs alongside text prompts. It utilizes bfloat16 precision for efficient inference and supports automatic device mapping. The implementation includes sophisticated audio processing capabilities and a chat template system for structured input handling.

Built on Qwen2-Audio-7B-Instruct architecture
Optimized using GRPO reinforcement learning
Supports 16kHz audio input
Implements structured output formatting with answer tags

Core Capabilities

Sound understanding: 69.37% accuracy
Music analysis: 66.77% accuracy
Speech comprehension: 57.36% accuracy
Overall average performance: 64.50% accuracy on MMAU Test-mini
Outperforms direct inference methods and other state-of-the-art models

Frequently Asked Questions

Q: What makes this model unique?

This model stands out due to its use of reinforcement learning (GRPO) for optimization, which has proven more effective than traditional supervised fine-tuning approaches. It achieves state-of-the-art performance on the MMAU Test-mini benchmark with minimal post-training samples.

Q: What are the recommended use cases?

The model is particularly well-suited for audio question answering tasks across various domains including sound event recognition, music analysis, and speech comprehension. It can process audio inputs and provide structured responses based on specific questions about the audio content.

r1-aqa

r1-aqa

What is r1-aqa?

Implementation Details

Core Capabilities

Frequently Asked Questions

Q: What makes this model unique?

Q: What are the recommended use cases?

Related Models