Large Audio-Language Models (ALMs) have recently demonstrated remarkable capabilities in holistic audio understanding, yet they remain unreliable for temporal grounding, i.e., the task of pinpointing exactly when an event occurs within long-form audio. This limitation stems from two factors: training data dominated by clip-level supervision lacking precise timestamps, and benchmarks that fail to simulate real-world scenarios where short events are obscured by dense background sounds. In this paper, we introduce SpotSound, an audio language model designed for grounding audio events. SpotSound incorporates a novel training objective, specifically designed to suppress hallucinated timestamps for events absent from the input. Additionally, we present SpotSound-Bench, a challenging temporal grounding benchmark where target events occupy less than ~10\% of each clip, creating a rigorous `needle-in-a-haystack' evaluation. Experiments demonstrate that SpotSound achieves state-of-the-art results on temporal grounding benchmarks while maintaining robust performance across general downstream audio-language tasks. Code, models and benchmark are released on https://loiesun.github.io/spotsound/
翻译:大型音频语言模型(ALMs)近期在全感官音频理解方面展现出强大能力,但在时间定位任务(即精确定位长音频中事件发生的时间点)上仍不可靠。这一局限性源于两个因素:训练数据以缺乏精确时间戳的片段级监督为主,以及现有基准无法模拟短时事件被密集背景噪声掩盖的真实场景。本文提出SpotSound——一种专为音频事件定位设计的音频语言模型。SpotSound采用创新训练目标,旨在抑制输入中缺失事件的时间戳幻觉。此外,我们推出SpotSound-Bench时间定位基准,其中目标事件时长不足每个音频片段的10%,构成严苛的"大海捞针"式评估。实验表明,SpotSound在时间定位基准上取得最先进结果,同时在通用下游音频语言任务中保持稳健性能。代码、模型及基准已发布于https://loiesun.github.io/spotsound/