Large Audio-Language Models (LALMs) have recently shown impressive progress in speech recognition, audio captioning, and auditory question answering. Yet, whether these models can perceive spatial dynamics, particularly the motion of sound sources, remains unclear. In this work, we uncover a systematic motion perception deficit in current ALLMs. To investigate this issue, we introduce AudioMotionBench, the first benchmark explicitly designed to evaluate auditory motion understanding. AudioMotionBench introduces a controlled question-answering benchmark designed to evaluate whether Audio-Language Models (LALMs) can infer the direction and trajectory of moving sound sources from binaural audio. Comprehensive quantitative and qualitative analyses reveal that current models struggle to reliably recognize motion cues or distinguish directional patterns. The average accuracy remains below 50\%, underscoring a fundamental limitation in auditory spatial reasoning. Our study highlights a fundamental gap between human and model auditory spatial reasoning, providing both a diagnostic tool and new insight for enhancing spatial cognition in future Audio-Language Models.
翻译:大型音频-语言模型(LALMs)近期在语音识别、音频描述和听觉问答方面展现出显著进展。然而,这些模型是否能感知空间动态,特别是声源的运动,目前尚不明确。本研究发现当前音频-语言模型存在系统性的运动感知缺陷。为探究此问题,我们提出了首个专门用于评估听觉运动理解能力的基准测试——AudioMotionBench。该基准通过一个受控的问答任务,旨在评估音频-语言模型能否从双耳音频中推断移动声源的方向和轨迹。全面的定量与定性分析表明,现有模型难以可靠地识别运动线索或区分方向模式,平均准确率低于50%,揭示了其在听觉空间推理方面的根本性局限。本研究凸显了人类与模型在听觉空间推理能力上的本质差距,为未来增强音频-语言模型的空间认知能力提供了诊断工具和新见解。