Fine-grained spatio-temporal understanding is essential for video reasoning and embodied AI. Yet, while Multimodal Large Language Models (MLLMs) master static semantics, their grasp of temporal dynamics remains brittle. We present TimeBlind, a diagnostic benchmark for compositional spatio-temporal understanding. Inspired by cognitive science, TimeBlind categorizes fine-grained temporal understanding into three levels: recognizing atomic events, characterizing event properties, and reasoning about event interdependencies. Unlike benchmarks that conflate recognition with temporal reasoning, TimeBlind leverages a minimal-pairs paradigm: video pairs share identical static visual content but differ solely in temporal structure, utilizing complementary questions to neutralize language priors. Evaluating over 20 state-of-the-art MLLMs (e.g., GPT-5, Gemini 3 Pro) on 600 curated instances (2400 video-question pairs), reveals that the Instance Accuracy (correctly distinguishing both videos in a pair) of the best performing MLLM is only 48.2%, far below the human performance (98.2%). These results demonstrate that even frontier models rely heavily on static visual shortcuts rather than genuine temporal logic, positioning TimeBlind as a vital diagnostic tool for next-generation video understanding. Dataset and code are available at https://baiqi-li.github.io/timeblind_project/ .
翻译:细粒度时空理解对于视频推理与具身人工智能至关重要。然而,尽管多模态大语言模型(MLLMs)已掌握静态语义,其对时间动态的把握仍显脆弱。本文提出TimeBlind——一个用于诊断组合式时空理解的基准。受认知科学启发,TimeBlind将细粒度时间理解划分为三个层次:识别原子事件、刻画事件属性、推理事件间依赖关系。与那些将识别与时间推理混为一谈的基准不同,TimeBlind采用最小对比对范式:视频对共享完全相同的静态视觉内容,仅通过时间结构进行区分,并利用互补性问题来消除语言先验的影响。通过对20余个前沿MLLM(如GPT-5、Gemini 3 Pro)在600个精选实例(2400个视频-问题对)上的评估发现,性能最佳MLLM的实例准确率(能正确区分同一对中的两个视频)仅为48.2%,远低于人类表现(98.2%)。这些结果表明,即使是前沿模型也严重依赖静态视觉捷径而非真正的时间逻辑,这使TimeBlind成为新一代视频理解的关键诊断工具。数据集与代码公开于 https://baiqi-li.github.io/timeblind_project/ 。