Existing benchmarks for Vision-Language Models (VLMs) primarily evaluate spatio-temporal understanding on simple single-action videos, closed attribute sets and restricted entity types, failing to capture the freeform, multi-action interactions between diverse entities which characterize real-world video understanding. Furthermore, the lack of a systematic framework for analyzing model failures across complementary spatio-temporal axes hinders comprehensive evaluation. To address these gaps, we introduce VISTA, a Video Interaction Spatio-Temporal Analysis benchmark designed for open-set, multi-entity and multi-action spatio-temporal understanding in VLMs. VISTA decomposes videos into interpretable entities, their associated actions, and relational dynamics, enabling multi-axis diagnostics and unified assessment of relational, spatial, and temporal understanding. Our benchmark integrates multiple datasets into a single interaction-aware taxonomy and comprises ~12K curated video-query pairs spanning diverse scenes and complexities. We systematically evaluate 11 state-of-the-art VLMs on VISTA, and break down aggregate performance across our taxonomy to reveal shortcomings and pronounced spatio-temporal biases obscured by traditional metrics. By providing detailed, taxonomy-driven diagnostics on a challenging dataset, VISTA offers a nuanced framework to guide advances in model design, pretraining strategies, and evaluation protocols. Overall, VISTA is the first, large-scale, interaction-aware diagnostic benchmark for spatio-temporal understanding in VLMs.
翻译:现有视觉-语言模型(VLM)基准主要评估对简单单动作视频、封闭属性集和受限实体类型的时空理解,无法捕捉真实世界视频理解中多样实体间的自由形式、多动作交互。此外,缺乏系统性框架来跨互补时空轴分析模型失败案例,阻碍了全面评估。为填补这些空白,我们提出VISTA——专用于VLM开放集、多实体、多动作时空理解的视频交互时空分析基准。VISTA将视频解构为可解释实体及其关联动作与关系动态,支持多轴诊断以及对关系、空间和时间理解的统一评估。本基准将多个数据集整合为单一交互感知分类体系,涵盖约12K个精心设计的视频-查询对,覆盖多样场景与复杂度。我们系统评估了11种前沿VLM在VISTA上的性能,并跨分类体系分解整体表现,揭示了传统指标所掩盖的缺陷与显著的时空偏差。通过在高挑战数据集上提供基于分类体系的详细诊断,VISTA为模型设计、预训练策略及评估协议的发展提供了精细化框架。总体而言,VISTA是首个大规模、交互感知的VLM时空理解诊断基准。