Fine-grained spatio-temporal understanding is essential for video reasoning and embodied AI. Yet, while Multimodal Large Language Models (MLLMs) master static semantics, their grasp of temporal dynamics remains brittle. We present TimeBlind, a diagnostic benchmark for compositional spatio-temporal understanding. Inspired by cognitive science, TimeBlind categorizes fine-grained temporal understanding into three levels: recognizing atomic events, characterizing event properties, and reasoning about event interdependencies. Unlike benchmarks that conflate recognition with temporal reasoning, TimeBlind leverages a minimal-pairs paradigm: video pairs share identical static visual content but differ solely in temporal structure, utilizing complementary questions to neutralize language priors. Evaluating over 20 state-of-the-art MLLMs (e.g., GPT-5, Gemini 3 Pro) on 600 curated instances (2400 video-question pairs), reveals that the Instance Accuracy (correctly distinguishing both videos in a pair) of the best performing MLLM is only 48.2%, far below the human performance (98.2%). These results demonstrate that even frontier models rely heavily on static visual shortcuts rather than genuine temporal logic, positioning TimeBlind as a vital diagnostic tool for next-generation video understanding. Dataset and code are available at https://baiqi-li.github.io/timeblind_project/ .
翻译:细粒度时空理解对于视频推理与具身智能至关重要。然而,尽管多模态大语言模型(MLLMs)已掌握静态语义,其对时序动态的把握仍显脆弱。本文提出TimeBlind——一个用于诊断组合式时空理解能力的基准测试。受认知科学启发,TimeBlind将细粒度时序理解划分为三个层级:原子事件识别、事件属性刻画以及事件互依性推理。与那些混淆识别能力与时序推理的基准不同,TimeBlind采用最小对比对范式:视频对共享完全相同的静态视觉内容,仅通过时序结构形成差异,并利用互补性问题来消除语言先验影响。通过对20余个前沿MLLM(如GPT-5、Gemini 3 Pro)在600个精选样本(2400个视频-问题对)上的评估,发现性能最佳模型的实例准确率(能正确区分视频对中两个视频)仅为48.2%,远低于人类表现(98.2%)。这些结果表明即使是最先进的模型也严重依赖静态视觉捷径而非真正的时序逻辑,这使TimeBlind成为新一代视频理解能力的关键诊断工具。数据集与代码公开于https://baiqi-li.github.io/timeblind_project/。