Recent advancements in large language models, multimodal large language models, and large audio language models (LALMs) have significantly improved their reasoning capabilities through reinforcement learning with rule-based rewards. However, the explicit reasoning process has yet to show significant benefits for audio question answering, and effectively leveraging deep reasoning remains an open challenge, with LALMs still falling short of human-level auditory-language reasoning. To address these limitations, we propose Audio-Thinker, a reinforcement learning framework designed to enhance the reasoning capabilities of LALMs, with a focus on improving adaptability, consistency, and effectiveness. Our approach introduces an adaptive think accuracy reward, enabling the model to adjust its reasoning strategies based on task complexity dynamically. Furthermore, we incorporate an external reward model to evaluate the overall consistency and quality of the reasoning process, complemented by think-based rewards that help the model distinguish between valid and flawed reasoning paths during training. Experimental results demonstrate that our Audio-Thinker model outperforms existing reasoning-oriented LALMs across various benchmark tasks, exhibiting superior reasoning and generalization capabilities.
翻译:近年来,大型语言模型、多模态大型语言模型以及大型音频语言模型(LALMs)通过基于规则的奖励强化学习,其推理能力得到了显著提升。然而,显式推理过程在音频问答任务中尚未展现出显著优势,如何有效利用深度推理仍是一个开放挑战,LALMs在听觉-语言推理方面仍未能达到人类水平。为应对这些局限,我们提出了Audio-Thinker,一个旨在增强LALMs推理能力的强化学习框架,重点关注提升其适应性、一致性和有效性。我们的方法引入了自适应思考准确性奖励,使模型能够根据任务复杂度动态调整其推理策略。此外,我们整合了一个外部奖励模型来评估推理过程的整体一致性和质量,辅以基于思考的奖励,帮助模型在训练过程中区分有效与有缺陷的推理路径。实验结果表明,我们的Audio-Thinker模型在多种基准任务中超越了现有的面向推理的LALMs,展现出更优的推理和泛化能力。