Multimodal Large Language Models (MLLMs) have recently achieved remarkable progress in vision-language understanding. Yet, human perception is inherently multisensory, integrating sight, sound, and motion to reason about the world. Among these modalities, sound provides indispensable cues about spatial layout, off-screen events, and causal interactions, particularly in egocentric settings where auditory and visual signals are tightly coupled. To this end, we introduce EgoSound, the first benchmark designed to systematically evaluate egocentric sound understanding in MLLMs. EgoSound unifies data from Ego4D and EgoBlind, encompassing both sighted and sound-dependent experiences. It defines a seven-task taxonomy spanning intrinsic sound perception, spatial localization, causal inference, and cross-modal reasoning. Constructed through a multi-stage auto-generative pipeline, EgoSound contains 7315 validated QA pairs across 900 videos. Comprehensive experiments on nine state-of-the-art MLLMs reveal that current models exhibit emerging auditory reasoning abilities but remain limited in fine-grained spatial and causal understanding. EgoSound establishes a challenging foundation for advancing multisensory egocentric intelligence, bridging the gap between seeing and truly hearing the world.
翻译:多模态大语言模型(MLLMs)近期在视觉-语言理解领域取得了显著进展。然而,人类感知本质上是多感官的,它通过整合视觉、听觉和运动来理解世界。在这些模态中,声音提供了关于空间布局、屏幕外事件以及因果交互不可或缺的线索,尤其是在听觉与视觉信号紧密耦合的以自我为中心的场景中。为此,我们提出了EgoSound,这是首个旨在系统评估MLLMs在以自我为中心场景下声音理解能力的基准。EgoSound整合了来自Ego4D和EgoBlind的数据,涵盖了有视觉辅助和依赖声音的两种体验。它定义了一个包含七项任务的分类体系,涵盖内在声音感知、空间定位、因果推理和跨模态推理。通过一个多阶段自动生成流程构建,EgoSound包含了来自900个视频的7315个经过验证的问答对。在九个最先进的MLLMs上进行的全面实验表明,当前模型展现出初步的听觉推理能力,但在细粒度的空间和因果理解方面仍然存在局限。EgoSound为推进多感官自我中心智能建立了一个具有挑战性的基础,弥合了“看见”世界与真正“听见”世界之间的鸿沟。