Large audio language models are increasingly used for complex audio understanding tasks, but they struggle with temporal tasks that require precise temporal grounding, such as word alignment and speaker diarization. The standard approach, where we generate timestamps as sequences of text tokens, is computationally expensive and prone to hallucination, especially when processing audio lengths outside the model's training distribution. In this work, we propose frame-level internal tool use, a method that trains audio LMs to use their own internal audio representations to perform temporal grounding directly. We introduce a lightweight prediction mechanism trained via two objectives: a binary frame classifier and a novel inhomogeneous Poisson process (IHP) loss that models temporal event intensity. Across word localization, speaker diarization, and event localization tasks, our approach outperforms token-based baselines. Most notably, it achieves a >50x inference speedup and demonstrates robust length generalization, maintaining high accuracy on out-of-distribution audio durations where standard token-based models collapse completely.
翻译:大型音频语言模型正日益用于复杂的音频理解任务,但在需要精确时序定位的任务(如词对齐和说话人日志)上仍存在困难。标准方法将时间戳生成为文本标记序列,这种计算成本高昂且容易产生幻觉,尤其在处理超出模型训练分布范围的音频长度时。本文提出帧级内部工具使用方法,该方法训练音频语言模型直接利用其内部音频表征执行时序定位。我们引入一种轻量级预测机制,通过两个目标进行训练:二元帧分类器及一种新颖的非齐次泊松过程损失函数,该函数用于建模时序事件强度。在词语定位、说话人日志和事件定位任务中,我们的方法均优于基于标记的基线模型。最显著的是,该方法实现了超过50倍的推理加速,并展现出强大的长度泛化能力,在基于标记的标准模型完全失效的分布外音频时长上仍保持高精度。