TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models

Mu Cai,Reuben Tan,Jianrui Zhang,Bocheng Zou,Kai Zhang,Feng Yao,Fangrui Zhu,Jing Gu,Yiwu Zhong,Yuzhang Shang,Yao Dou,Jaden Park,Jianfeng Gao,Yong Jae Lee,Jianwei Yang

from arxiv, Project Page: https://temporalbench.github.io/

Understanding fine-grained temporal dynamics is crucial for multimodal video comprehension and generation. Due to the lack of fine-grained temporal annotations, existing video benchmarks mostly resemble static image benchmarks and are incompetent at evaluating models for temporal understanding. In this paper, we introduce TemporalBench, a new benchmark dedicated to evaluating fine-grained temporal understanding in videos. TemporalBench consists of ~10K video question-answer pairs, derived from ~2K high-quality human annotations detailing the temporal dynamics in video clips. As a result, our benchmark provides a unique testbed for evaluating various temporal understanding and reasoning abilities such as action frequency, motion magnitude, event order, etc. Moreover, it enables evaluations on various tasks like both video question answering and captioning, both short and long video understanding, as well as different models such as multimodal video embedding models and text generation models. Results show that state-of-the-art models like GPT-4o achieve only 38.5% question answering accuracy on TemporalBench, demonstrating a significant gap (~30%) between humans and AI in temporal understanding. Furthermore, we notice a critical pitfall for multi-choice QA where LLMs can detect the subtle changes in negative captions and find a centralized description as a cue for its prediction, where we propose Multiple Binary Accuracy (MBA) to correct such bias. We hope that TemporalBench can foster research on improving models' temporal reasoning capabilities. Both dataset and evaluation code will be made available.

翻译：理解细粒度时序动态对于多模态视频内容理解与生成至关重要。由于缺乏细粒度的时序标注，现有视频基准评测大多类似于静态图像基准，难以有效评估模型的时序理解能力。本文提出TemporalBench——一个专门用于评估视频中细粒度时序理解能力的新型基准。该基准包含约10K个视频问答对，源自约2K个人工标注的高质量视频片段时序动态描述。因此，我们的基准为评估多种时序理解与推理能力（如动作频率、运动幅度、事件顺序等）提供了独特的测试平台。此外，它支持对多种任务的评估，包括视频问答与描述生成、短视频与长视频理解，以及多模态视频嵌入模型和文本生成模型等不同模型架构的评测。实验结果表明，GPT-4o等前沿模型在TemporalBench上的问答准确率仅为38.5%，揭示了人类与AI在时序理解方面存在显著差距（约30%）。进一步地，我们发现多项选择问答中存在一个关键缺陷：大语言模型能够检测负样本描述中的细微变化，并寻找集中式描述作为预测线索。为此，我们提出多重二元准确率（MBA）指标以修正此类偏差。我们希望TemporalBench能够推动提升模型时序推理能力的研究。数据集与评估代码将同步公开。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日