A core aspect of human perception is situated awareness, the ability to relate ourselves to the surrounding physical environment and reason over possible actions in context. However, most existing benchmarks for multimodal foundation models (MFMs) emphasize environment-centric spatial relations (relations among objects in a scene), while largely overlooking observer-centric relationships that require reasoning relative to agent's viewpoint, pose, and motion. To bridge this gap, we introduce SAW-Bench (Situated Awareness in the Real World), a novel benchmark for evaluating egocentric situated awareness using real-world videos. SAW-Bench comprises 786 self-recorded videos captured with Ray-Ban Meta (Gen 2) smart glasses spanning diverse indoor and outdoor environments, and over 2,071 human-annotated question-answer pairs. It probes a model's observer-centric understanding with six different awareness tasks. Our comprehensive evaluation reveals a human-model performance gap of 37.66%, even with the best-performing MFM, Gemini 3 Flash. Beyond this gap, our in-depth analysis uncovers several notable findings; for example, while models can exploit partial geometric cues in egocentric videos, they often fail to infer a coherent camera geometry, leading to systematic spatial reasoning errors. We position SAW-Bench as a benchmark for situated spatial intelligence, moving beyond passive observation to understanding physically grounded, observer-centric dynamics.
翻译:人类感知的一个核心方面是情境感知,即能够将自身与周围的物理环境联系起来,并在特定情境中推理可能的行动。然而,现有的大多数多模态基础模型(MFMs)的基准测试都强调以环境为中心的空间关系(场景中物体之间的关系),而很大程度上忽略了需要相对于智能体视角、姿态和运动进行推理的以观察者为中心的关系。为了弥补这一差距,我们引入了SAW-Bench(现实世界中的情境感知),这是一个利用真实世界视频评估自我中心情境感知的新基准。SAW-Bench包含786个使用Ray-Ban Meta(第二代)智能眼镜自录的视频,涵盖多样化的室内外环境,以及超过2,071个人工标注的问答对。它通过六种不同的感知任务来探究模型以观察者为中心的理解能力。我们的综合评估显示,即使使用性能最佳的多模态基础模型Gemini 3 Flash,人机性能差距仍高达37.66%。除了这一差距,我们的深入分析还揭示了几个值得注意的发现;例如,虽然模型能够利用自我中心视频中的部分几何线索,但它们常常无法推断出一致的相机几何,从而导致系统性的空间推理错误。我们将SAW-Bench定位为情境空间智能的基准,旨在超越被动观察,实现对物理基础、以观察者为中心的动态过程的理解。