While LALMs show promise on audio question answering, they fail to focus on question-relevant segments of audio and provide a clear, checkable reasoning process when dealing with complex audio reasoning. Reinforcement learning and tool-augmented prompting can help models better relate questions to audio but lack a reliable way to understand, integrate, and self-verify audio segments. To address this gap, we present EChO-Agent, a modular agent framework that reformulates complex audio QA as a planning, tool execution, evidence integration, and answer verification workflow. Experiments on MMAR benchmark show EChO-Agent improves both accuracy and rubric scores over baseline and ablation studies show evidence integration is the key factor.
翻译:尽管大型音频语言模型(LALM)在音频问答任务中展现出潜力,但在处理复杂音频推理时,它们难以聚焦于与问题相关的音频片段,也无法提供清晰可验证的推理过程。强化学习和工具增强提示能够帮助模型更好地关联问题与音频,但在理解、整合及自我验证音频片段方面缺乏可靠方法。为填补这一空白,我们提出EChO-Agent——一种模块化智能体框架,将复杂音频问答重新定义为规划、工具执行、证据整合与答案验证的工作流。在MMAR基准上的实验表明,EChO-Agent在准确率和评分指标上均优于基线方法,消融实验进一步证实证据整合是提升性能的关键因素。