There has been growing sentiment recently that modern large multimodal models (LMMs) have addressed most of the key challenges related to short video comprehension. As a result, both academia and industry are gradually shifting their attention towards the more complex challenges posed by understanding long-form videos. However, is this really the case? Our studies indicate that LMMs still lack many fundamental reasoning capabilities even when dealing with short videos. We introduce Vinoground, a temporal counterfactual LMM evaluation benchmark encompassing 1000 short and natural video-caption pairs. We demonstrate that existing LMMs severely struggle to distinguish temporal differences between different actions and object transformations. For example, the best model GPT-4o only obtains ~50% on our text and video scores, showing a large gap compared to the human baseline of ~90%. All open-source multimodal models and CLIP-based models perform much worse, producing mostly random chance performance. Through this work, we shed light onto the fact that temporal reasoning in short videos is a problem yet to be fully solved. The dataset and evaluation code are available at https://vinoground.github.io.
翻译:近期学界普遍认为,现代大型多模态模型(LMMs)已基本解决短视频理解的关键挑战,学术界与工业界正逐渐将注意力转向长视频理解带来的更复杂问题。然而事实果真如此吗?我们的研究表明,即使在处理短视频时,LMMs仍缺乏诸多基础推理能力。本文提出Vinoground——一个包含1000个自然短视频-字幕对的时序反事实LMM评估基准。实验证明,现有LMMs在区分不同动作与物体转换的时序差异方面存在严重缺陷。例如,最优模型GPT-4o在文本与视频评分中仅获得约50%的准确率,与人类基线约90%的表现存在显著差距。所有开源多模态模型及基于CLIP的模型表现更差,其输出结果近乎随机。本研究揭示了一个重要事实:短视频时序推理仍是尚未完全解决的难题。数据集与评估代码已公开于https://vinoground.github.io。