Large Audio-Language Models (LALMs) perform well on audio understanding tasks but lack multistep reasoning and tool-calling found in recent Large Language Models (LLMs). This paper presents AudioToolAgent, a framework that coordinates audio-language models as tools via a central LLM agent that accesses tool adapters for audio question answering and speech-to-text. The agent reasons about which tools to invoke, how to formulate follow-up queries, and how to arbitrate conflicting tool outputs, without accessing the audio. Experiments with MMAU, MMAR, and MMAU-Pro show state-of-the-art accuracy: up to 77.50% in MMAU, 77.00% in MMAR, and 61.90% in MMAU-Pro. Shapley-based analysis identifies effective agent-tool combinations. The code and reproduction materials are available at https://github.com/GLJS/AudioToolAgent.
翻译:大型音频语言模型在音频理解任务中表现良好,但缺乏近期大型语言模型中常见的多步推理与工具调用能力。本文提出AudioToolAgent框架,该框架通过一个中央LLM智能体协调音频语言模型作为工具,该智能体可访问用于音频问答与语音转文本的工具适配器。该智能体能够在不接触音频数据的情况下,推理需要调用哪些工具、如何构建后续查询以及如何仲裁冲突的工具输出。在MMAU、MMAR和MMAU-Pro数据集上的实验表明其达到最先进准确率:在MMAU上最高达77.50%,在MMAR上达77.00%,在MMAU-Pro上达61.90%。基于Shapley值的分析识别出高效的智能体-工具组合。代码与复现材料已发布于https://github.com/GLJS/AudioToolAgent。