The ability to perceive how objects change over time is a crucial ingredient in human intelligence. However, current benchmarks cannot faithfully reflect the temporal understanding abilities of video-language models (VidLMs) due to the existence of static visual shortcuts. To remedy this issue, we present VITATECS, a diagnostic VIdeo-Text dAtaset for the evaluation of TEmporal Concept underStanding. Specifically, we first introduce a fine-grained taxonomy of temporal concepts in natural language in order to diagnose the capability of VidLMs to comprehend different temporal aspects. Furthermore, to disentangle the correlation between static and temporal information, we generate counterfactual video descriptions that differ from the original one only in the specified temporal aspect. We employ a semi-automatic data collection framework using large language models and human-in-the-loop annotation to obtain high-quality counterfactual descriptions efficiently. Evaluation of representative video-language understanding models confirms their deficiency in temporal understanding, revealing the need for greater emphasis on the temporal elements in video-language research.
翻译:感知物体随时间变化的能力是人类智能的关键要素。然而,由于静态视觉捷径的存在,当前基准测试无法真实反映视频语言模型(VidLMs)的时间理解能力。为解决这一问题,我们提出了VITATECS——一个用于评估时间概念理解的诊断性视频文本数据集。具体而言,我们首先引入自然语言中时间概念的细粒度分类法,以诊断VidLMs理解不同时间方面的能力。此外,为解耦静态信息与时间信息之间的相关性,我们生成了反事实视频描述,这些描述仅在与原始描述指定的时间方面存在差异。我们采用半自动化数据收集框架,利用大语言模型和人工循环标注高效获取高质量的反事实描述。对代表性视频语言理解模型的评估证实了它们时间理解能力的不足,揭示了视频语言研究中需更加重视时间元素的需求。