Physical video understanding requires more than naming an event correctly. A model can answer a question about pouring, sliding, or collision from textual regularities while still failing to localize the event in time or space. We introduce a grounded benchmark for physical video understanding that extends the what--when--where evaluation structure of V-STaR to four video sources, six physics domains, three prompt families (physics, vstar_like, and neutral_rstr), and four input conditions (original, shuffled, ablated, and frame-masked). The benchmark contains 1,560 base video clips from SSV2, YouCook2, HoloAssist, and Roundabout-TAU. Each clip is first converted into a shared grounded event record, and the three query families are derived from that record. Temporal and spatial targets are shared across prompt families, while the non-physics families use deterministic family-appropriate semantic a_what targets derived from the same record. Across models and prompt families, physics remains the strongest regime overall, vstar_like is the clearest non-physics semantic comparison, and neutral_rstr behaves as a harder templated control. Prompt-family robustness is selective rather than universal, perturbation gains cluster in weak original cases, and spatial grounding is the weakest across settings. These results suggest that video Q&A reasoning benchmarks shall report physically grounded, prompt-aware, and perturbation-aware diagnostics alongside aggregate accuracy.
翻译:物理视频理解不仅仅是正确命名一个事件。模型可能从文本规律中正确回答关于倾倒、滑动或碰撞的问题,但仍无法在时间或空间上定位该事件。我们提出了一个用于物理视频理解的接地基准,将V-STaR的“什么-何时-何地”评估结构扩展到四个视频源、六个物理领域、三个提示系列(物理、vstar_like和neutral_rstr)以及四个输入条件(原始、打乱、消融和帧遮蔽)。该基准包含来自SSV2、YouCook2、HoloAssist和Roundabout-TAU的1,560个基础视频片段。每个片段首先转换为一个共享的接地事件记录,三个查询系列由此记录衍生而来。时间和空间目标在各提示系列间共享,而非物理系列则使用从同一记录衍生的确定性、适合系列语义的“什么”目标。在模型和提示系列中,物理机制整体表现最强,vstar_like是最清晰的非物理语义比较对照,neutral_rstr则作为更难的模板化控制。提示系列鲁棒性是选择性的而非普遍的,扰动增益集中在弱原始案例中,空间定位于各设置中最弱。这些结果表明,视频问答推理基准应在总体准确率之外,报告基于物理的、对提示敏感的及对扰动敏感的诊断信息。