Existing benchmarks often highlight the remarkable performance achieved by state-of-the-art Multimodal Foundation Models (MFMs) in leveraging temporal context for video understanding. However, how well do the models truly perform visual temporal reasoning? Our study of existing benchmarks shows that this capability of MFMs is likely overestimated as many questions can be solved by using a single, few, or out-of-order frames. To systematically examine current visual temporal reasoning tasks, we propose three principles with corresponding metrics: (1) Multi-Frame Gain, (2) Frame Order Sensitivity, and (3) Frame Information Disparity. Following these principles, we introduce TOMATO, Temporal Reasoning Multimodal Evaluation, a novel benchmark crafted to rigorously assess MFMs' temporal reasoning capabilities in video understanding. TOMATO comprises 1,484 carefully curated, human-annotated questions spanning six tasks (i.e., action count, direction, rotation, shape & trend, velocity & frequency, and visual cues), applied to 1,417 videos, including 805 self-recorded and -generated videos, that encompass human-centric, real-world, and simulated scenarios. Our comprehensive evaluation reveals a human-model performance gap of 57.3% with the best-performing model. Moreover, our in-depth analysis uncovers more fundamental limitations beyond this gap in current MFMs. While they can accurately recognize events in isolated frames, they fail to interpret these frames as a continuous sequence. We believe TOMATO will serve as a crucial testbed for evaluating the next-generation MFMs and as a call to the community to develop AI systems capable of comprehending human world dynamics through the video modality.
翻译:现有基准测试常常突出现有多模态基础模型在利用时序上下文进行视频理解方面取得的卓越性能。然而,这些模型在视觉时序推理方面的真实表现究竟如何?我们对现有基准的研究表明,MFMs的这项能力很可能被高估了,因为许多问题可以通过使用单个、少数或乱序的帧来解决。为了系统地检验当前的视觉时序推理任务,我们提出了三个原则及其相应的度量指标:(1) 多帧增益,(2) 帧序敏感性,以及(3) 帧信息差异性。遵循这些原则,我们引入了TOMATO(时序推理多模态评估),这是一个旨在严格评估MFMs在视频理解中时序推理能力的新型基准。TOMATO包含1,484个精心策划、人工标注的问题,涵盖六项任务(即动作计数、方向、旋转、形状与趋势、速度与频率以及视觉线索),应用于1,417个视频,其中包括805个自行录制和生成的视频,涵盖了以人为中心、现实世界和模拟场景。我们的全面评估揭示了最佳性能模型与人类之间存在57.3%的性能差距。此外,我们的深入分析揭示了当前MFMs在此差距之外更根本的局限性。虽然它们能够准确识别孤立帧中的事件,却无法将这些帧解释为一个连续的序列。我们相信,TOMATO将作为评估下一代MFMs的关键测试平台,并呼吁社区开发能够通过视频模态理解人类世界动态的AI系统。