Recently, reinforcement learning (RL) has been shown to greatly enhance the reasoning capabilities of large language models (LLMs), and RL-based approaches have been progressively applied to visual multimodal tasks. However, the audio modality has largely been overlooked in these developments. Thus, we conduct a series of RL explorations in audio understanding and reasoning, specifically focusing on the audio question answering (AQA) task. We leverage the group relative policy optimization (GRPO) algorithm to Qwen2-Audio-7B-Instruct, and our experiments demonstrated state-of-the-art performance on the MMAU Test-mini benchmark, achieving an accuracy rate of 64.5%. The main findings in this technical report are as follows: 1) The GRPO algorithm can be effectively applied to large audio language models (LALMs), even when the model has only 8.2B parameters; 2) With only 38k post-training samples, RL significantly outperforms supervised fine-tuning (SFT), indicating that RL-based approaches can be effective without large datasets; 3) The explicit reasoning process has not shown significant benefits for AQA tasks, and how to efficiently utilize deep thinking remains an open question for further research; 4) LALMs still lag far behind humans auditory-language reasoning, suggesting that the RL-based approaches warrant further exploration. Our project is available at https://github.com/xiaomi-research/r1-aqa and https://huggingface.co/mispeech/r1-aqa.
翻译:近期研究表明,强化学习(RL)能显著提升大语言模型(LLMs)的推理能力,基于RL的方法已逐步应用于视觉多模态任务。然而,音频模态在此类发展中长期被忽视。为此,我们在音频理解与推理领域开展了一系列强化学习探索,重点关注音频问答(AQA)任务。我们将分组相对策略优化(GRPO)算法应用于Qwen2-Audio-7B-Instruct模型,实验在MMAU Test-mini基准测试中取得了最先进的性能,准确率达到64.5%。本技术报告的主要发现如下:1)GRPO算法可有效应用于大型音频语言模型(LALMs),即使模型参数量仅为82亿;2)仅使用38k训练后样本,强化学习即显著超越监督微调(SFT),表明基于RL的方法无需大规模数据集即可生效;3)显式推理过程尚未对AQA任务展现出显著优势,如何高效利用深度思考仍是待进一步研究的开放问题;4)LALMs在听觉-语言推理方面仍远落后于人类,提示基于RL的方法值得深入探索。本项目开源地址:https://github.com/xiaomi-research/r1-aqa 与 https://huggingface.co/mispeech/r1-aqa。