STAR-Bench: Probing Deep Spatio-Temporal Reasoning as Audio 4D Intelligence

Despite rapid progress in Multi-modal Large Language Models and Large Audio-Language Models, existing audio benchmarks largely test semantics that can be recovered from text captions, masking deficits in fine-grained perceptual reasoning. We formalize audio 4D intelligence that is defined as reasoning over sound dynamics in time and 3D space, and introduce STAR-Bench to measure it. STAR-Bench combines a Foundational Acoustic Perception setting (six attributes under absolute and relative regimes) with a Holistic Spatio-Temporal Reasoning setting that includes segment reordering for continuous and discrete processes and spatial tasks spanning static localization, multi-source relations, and dynamic trajectories. Our data curation pipeline uses two methods to ensure high-quality samples. For foundational tasks, we use procedurally synthesized and physics-simulated audio. For holistic data, we follow a four-stage process that includes human annotation and final selection based on human performance. Unlike prior benchmarks where caption-only answering reduces accuracy slightly, STAR-Bench induces far larger drops (-31.5\% temporal, -35.2\% spatial), evidencing its focus on linguistically hard-to-describe cues. Evaluating 19 models reveals substantial gaps compared with humans and a capability hierarchy: closed-source models are bottlenecked by fine-grained perception, while open-source models lag across perception, knowledge, and reasoning. Our STAR-Bench provides critical insights and a clear path forward for developing future models with a more robust understanding of the physical world.

翻译：尽管多模态大语言模型和大音频-语言模型取得了快速进展，但现有的音频基准测试主要检验可从文本描述中恢复的语义信息，掩盖了细粒度感知推理方面的不足。我们形式化了音频4D智能，其定义为对声音在时间和三维空间中动态变化的推理能力，并引入STAR-Bench对其进行量化。STAR-Bench结合了基础声学感知场景（包含绝对与相对机制下的六种属性）和整体时空推理场景，后者涵盖连续与离散过程的片段重排序任务，以及静态定位、多源关系和动态轨迹等空间任务。我们的数据构建流程采用两种方法确保样本高质量：针对基础任务，使用程序化合成与物理模拟音频；针对整体数据，遵循包含人工标注和基于人类表现最终筛选的四阶段流程。与先前基准测试中仅依赖字幕回答导致准确率轻微下降的情况不同，STAR-Bench引发了更显著的性能下降（时间任务-31.5%，空间任务-35.2%），证明其专注于语言难以描述的线索。对19个模型的评估揭示了与人类表现间的显著差距及能力层级：闭源模型受限于细粒度感知能力，而开源模型在感知、知识和推理层面均存在滞后。我们的STAR-Bench为开发未来模型提供了关键洞见和清晰路径，推动模型实现对物理世界更鲁棒的理解。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一种无需使用负样本的自监督学习方法，Self-Supervised Predictive Learning: A Negative-Free Method for Sound Source Localization in Visual Scenes

专知会员服务

15+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日