Large Audio Language Models (LALMs) have demonstrated strong capabilities in audio understanding and reasoning. However, their performance on fine grained auditory perception remains unreliable, and existing approaches largely rely on data intensive training to internalize perceptual abilities. We propose AudioRouter, a reinforcement learning framework that enables LALMs to improve audio understanding by learning when and how to use external audio tools. Rather than tightly coupling tool usage with audio reasoning, AudioRouter formulates tool use as an explicit decision making problem and optimizes a lightweight routing policy while keeping the underlying reasoning model frozen. Experimental results show that AudioRouter achieves substantial improvements on standard audio understanding benchmarks while requiring up to 600x less training data to learn tool usage compared with conventional training paradigms. These findings suggest that learning effective tool usage offers a data efficient and scalable alternative to internalizing perceptual abilities in LALMs.
翻译:大型音频语言模型(LALMs)已在音频理解与推理任务中展现出强大能力。然而,其在细粒度听觉感知方面的性能仍不可靠,且现有方法主要依赖数据密集型训练来内化感知能力。本文提出AudioRouter,一种强化学习框架,使LALMs能够通过学习何时及如何使用外部音频工具来提升音频理解能力。与将工具使用和音频推理紧密耦合的传统方式不同,AudioRouter将工具调用建模为显式决策问题,在保持底层推理模型冻结的同时优化轻量级路由策略。实验结果表明,AudioRouter在标准音频理解基准上取得显著性能提升,且学习工具使用所需的训练数据量相比传统训练范式减少高达600倍。这些发现表明,学习有效的工具使用为LALMs提供了一种数据高效且可扩展的感知能力实现路径,无需完全内化感知模块。