Modern vision-language models (VLMs) excel at many multimodal tasks, yet their grasp of temporal information in video remains weak and, crucially, under-evaluated. We probe this gap with a deceptively simple but revealing challenge: judging the arrow of time (AoT)-whether a short clip is played forward or backward. We introduce AoT-PsyPhyBENCH, a psychophysically validated benchmark that tests whether VLMs can infer temporal direction in natural videos using the same stimuli and behavioral baselines established for humans. Our comprehensive evaluation of open-weight and proprietary, reasoning and non-reasoning VLMs reveals that most models perform near chance, and even the best lag far behind human accuracy on physically irreversible processes (e.g., free fall, diffusion/explosion) and causal manual actions (division/addition) that humans recognize almost instantly. These results highlight a fundamental gap in current multimodal systems: while they capture rich visual-semantic correlations, they lack the inductive biases required for temporal continuity and causal understanding. We release the code and data for AoT-PsyPhyBENCH to encourage further progress in the physical and temporal reasoning capabilities of VLMs.
翻译:现代视觉语言模型(VLMs)在多模态任务中表现出色,但其对视频中时序信息的理解仍显薄弱,且关键问题在于缺乏系统评估。我们通过一个看似简单但极具揭示性的挑战来探究这一缺陷:判断时间箭头(AoT)——即判断短视频片段是正向播放还是反向播放。我们提出了AoT-PsyPhyBENCH,这是一个经过心理物理学验证的基准测试,旨在检验VLMs能否使用与人类实验相同的刺激材料和行为基线,对自然视频中的时序方向进行推断。我们对开源与专有、推理型与非推理型VLMs进行了全面评估,结果表明:大多数模型的性能接近随机猜测,即使在物理不可逆过程(如自由落体、扩散/爆炸)和人类几乎能瞬时识别的因果性手动操作(分割/叠加)等任务上,表现最佳的模型也远落后于人类准确率。这些结果凸显了当前多模态系统的一个根本性缺陷:尽管它们能够捕捉丰富的视觉-语义关联,但缺乏理解时序连续性与因果关系的归纳偏置。我们公开了AoT-PsyPhyBENCH的代码与数据,以促进VLMs在物理与时序推理能力方面的进一步研究。