Spatial reasoning in vision language models (VLMs) remains fragile when semantics hinge on subtle temporal or geometric cues. We introduce a synthetic benchmark that probes two complementary skills: situational awareness (recognizing whether an interaction is harmful or benign) and spatial awareness (tracking who does what to whom, and reasoning about relative positions and motion). Through minimal video pairs, we test three challenges: distinguishing violence from benign activity, binding assailant roles across viewpoints, and judging fine-grained trajectory alignment. While we evaluate recent VLMs in a training-free setting, the benchmark is applicable to any video classification model. Results show performance only slightly above chance across tasks. A simple aid, stable color cues, partly reduces assailant role confusions but does not resolve the underlying weakness. By releasing data and code, we aim to provide reproducible diagnostics and seed exploration of lightweight spatial priors to complement large-scale pretraining.
翻译:当语义依赖于细微的时间或几何线索时,视觉语言模型的空间推理能力依然脆弱。本文引入一个合成基准,用于探究两种互补的技能:情境感知(识别交互行为是有害还是良性的)与空间感知(追踪“谁对谁做了什么”,并推理相对位置与运动)。通过极简视频对,我们测试了三项挑战:区分暴力与良性活动、跨视角绑定攻击者角色,以及判断细粒度轨迹对齐。我们在免训练设置下评估了近期视觉语言模型,但该基准适用于任何视频分类模型。结果显示,各项任务上的性能仅略高于随机水平。一种简单的辅助手段——稳定的颜色线索——部分减少了攻击者角色混淆,但未能解决根本性缺陷。通过开源数据与代码,我们旨在提供可复现的诊断工具,并推动轻量级空间先验的探索,以补充大规模预训练。