Understanding a video requires more than recognizing isolated moments, as humans continuously track entities, states, and events over time. This capacity for visual state tracking is fundamental to video understanding, yet remains underexplored in current evaluations of Multimodal Large Language Models (MLLMs). We introduce Visual STAte Tracking benchmark (VSTAT), a video-based benchmark designed to diagnose visual state tracking in MLLMs. VSTAT consists of 834 clips drawn from both synthetic and real-world videos, paired with 1,500 questions that cannot be answered from any single frame or short segment, requiring continuous perception and integration of events across the entire video stream. Despite their strong performance on existing video benchmarks, we find that state-of-the-art MLLMs perform far below humans and only modestly above answer-prior baselines. To analyze this gap, we compare MLLMs' thinking traces with the underlying video stream to understand why and when MLLMs fail on VSTAT. We find that MLLMs reason and track correctly in text, but fail at visually perceiving the events they need to track. Finally, our preliminary evaluation suggests that recent agentic approaches, including MLLM-based video agents and coding agents, do not readily resolve these failures, still falling short on VSTAT.
翻译:理解视频需要超越对孤立时刻的识别,因为人类会持续追踪随时间变化的实体、状态和事件。这种视觉状态追踪能力是视频理解的基础,但在当前多模态大语言模型的评估中仍未得到充分探索。我们提出视觉状态追踪基准(VSTAT)——一个专门用于诊断多模态大语言模型视觉状态追踪能力的视频基准。VSTAT包含来自合成视频和真实视频的834个片段,配以1500个无法通过任何单帧或短片段回答的问题,需要跨越整个视频流进行连续感知和事件整合。尽管现有视频基准上表现优异,但我们发现最先进的多模态大语言模型远低于人类水平,仅略高于基于回答先验的基线。为分析这一差距,我们对比了多模态大语言模型的推理轨迹与底层视频流,以理解其为何及何时在VSTAT上失败。我们发现多模态大语言模型在文本层面能正确推理和追踪,但未能视觉感知需要追踪的事件。最后,初步评估表明,近期基于智能体的方法(包括基于多模态大语言模型的视频智能体和编码智能体)未能有效解决这些失败,在VSTAT上仍然表现不佳。