Large Audio Language Models (LALMs) are increasingly applied to audio understanding and multimodal reasoning, yet their ability to locate when events occur remains underexplored. We present the first systematic study of temporal bias in LALMs, revealing a key limitation in their timestamp prediction. For example, when asked "At which second does the lecturer introduce the key formula?", models often predict timestamps that are consistently earlier or later than the ground truth. Through controlled experiments on timestamped datasets, we find that temporal bias (i) is prevalent across datasets and models, (ii) increases with audio length - even accumulating to tens of seconds in extended recordings, and (iii) varies across event types and positions. We quantify this effect with the Temporal Bias Index (TBI), measuring systematic misalignment in predicted event timings, and complement it with a visualization framework. Our findings highlight a fundamental limitation in current LALMs and call for the development of temporally robust architectures.
翻译:大型音频语言模型(LALMs)正日益应用于音频理解和多模态推理,然而,其定位事件发生时间的能力仍未得到充分探索。我们首次对LALMs中的时间偏差进行了系统性研究,揭示了其时间戳预测的一个关键局限。例如,当被问及"讲师在第几秒介绍了关键公式?"时,模型预测的时间戳往往持续早于或晚于真实值。通过在带时间戳的数据集上进行受控实验,我们发现时间偏差(i)在数据集和模型中普遍存在,(ii)随音频长度增加而加剧——在长录音中甚至可累积至数十秒,以及(iii)随事件类型和位置而变化。我们通过时间偏差指数(TBI)量化了这一效应,该指数衡量了预测事件时间中的系统性错位,并辅以可视化框架。我们的研究结果突显了当前LALMs的一个根本性局限,并呼吁开发具有时间鲁棒性的架构。