Despite rapid progress in Multi-modal Large Language Models and Large Audio-Language Models, existing audio benchmarks largely test semantics that can be recovered from text captions, masking deficits in fine-grained perceptual reasoning. We formalize audio 4D intelligence that is defined as reasoning over sound dynamics in time and 3D space, and introduce STAR-Bench to measure it. STAR-Bench combines a Foundational Acoustic Perception setting (six attributes under absolute and relative regimes) with a Holistic Spatio-Temporal Reasoning setting that includes segment reordering for continuous and discrete processes and spatial tasks spanning static localization, multi-source relations, and dynamic trajectories. Our data curation pipeline uses two methods to ensure high-quality samples. For foundational tasks, we use procedurally synthesized and physics-simulated audio. For holistic data, we follow a four-stage process that includes human annotation and final selection based on human performance. Unlike prior benchmarks where caption-only answering reduces accuracy slightly, STAR-Bench induces far larger drops (-31.5\% temporal, -35.2\% spatial), evidencing its focus on linguistically hard-to-describe cues. Evaluating 19 models reveals substantial gaps compared with humans and a capability hierarchy: closed-source models are bottlenecked by fine-grained perception, while open-source models lag across perception, knowledge, and reasoning. Our STAR-Bench provides critical insights and a clear path forward for developing future models with a more robust understanding of the physical world.
翻译:尽管多模态大语言模型和大音频-语言模型取得了快速进展,但现有的音频基准测试主要检验可从文本描述中恢复的语义信息,掩盖了细粒度感知推理方面的不足。我们形式化了音频4D智能,其定义为对声音在时间和三维空间中动态变化的推理能力,并引入STAR-Bench对其进行量化。STAR-Bench结合了基础声学感知场景(包含绝对与相对机制下的六种属性)和整体时空推理场景,后者涵盖连续与离散过程的片段重排序任务,以及静态定位、多源关系和动态轨迹等空间任务。我们的数据构建流程采用两种方法确保样本高质量:针对基础任务,使用程序化合成与物理模拟音频;针对整体数据,遵循包含人工标注和基于人类表现最终筛选的四阶段流程。与先前基准测试中仅依赖字幕回答导致准确率轻微下降的情况不同,STAR-Bench引发了更显著的性能下降(时间任务-31.5%,空间任务-35.2%),证明其专注于语言难以描述的线索。对19个模型的评估揭示了与人类表现间的显著差距及能力层级:闭源模型受限于细粒度感知能力,而开源模型在感知、知识和推理层面均存在滞后。我们的STAR-Bench为开发未来模型提供了关键洞见和清晰路径,推动模型实现对物理世界更鲁棒的理解。