Understanding fine-grained temporal dynamics is crucial for multimodal video comprehension and generation. Due to the lack of fine-grained temporal annotations, existing video benchmarks mostly resemble static image benchmarks and are incompetent at evaluating models for temporal understanding. In this paper, we introduce TemporalBench, a new benchmark dedicated to evaluating fine-grained temporal understanding in videos. TemporalBench consists of ~10K video question-answer pairs, derived from ~2K high-quality human annotations detailing the temporal dynamics in video clips. As a result, our benchmark provides a unique testbed for evaluating various temporal understanding and reasoning abilities such as action frequency, motion magnitude, event order, etc. Moreover, it enables evaluations on various tasks like both video question answering and captioning, both short and long video understanding, as well as different models such as multimodal video embedding models and text generation models. Results show that state-of-the-art models like GPT-4o achieve only 38.5% question answering accuracy on TemporalBench, demonstrating a significant gap (~30%) between humans and AI in temporal understanding. Furthermore, we notice a critical pitfall for multi-choice QA where LLMs can detect the subtle changes in negative captions and find a centralized description as a cue for its prediction, where we propose Multiple Binary Accuracy (MBA) to correct such bias. We hope that TemporalBench can foster research on improving models' temporal reasoning capabilities. Both dataset and evaluation code will be made available.
翻译:理解细粒度时序动态对于多模态视频内容理解与生成至关重要。由于缺乏细粒度的时序标注,现有视频基准评测大多类似于静态图像基准,难以有效评估模型的时序理解能力。本文提出TemporalBench——一个专门用于评估视频中细粒度时序理解能力的新型基准。该基准包含约10K个视频问答对,源自约2K个人工标注的高质量视频片段时序动态描述。因此,我们的基准为评估多种时序理解与推理能力(如动作频率、运动幅度、事件顺序等)提供了独特的测试平台。此外,它支持对多种任务的评估,包括视频问答与描述生成、短视频与长视频理解,以及多模态视频嵌入模型和文本生成模型等不同模型架构的评测。实验结果表明,GPT-4o等前沿模型在TemporalBench上的问答准确率仅为38.5%,揭示了人类与AI在时序理解方面存在显著差距(约30%)。进一步地,我们发现多项选择问答中存在一个关键缺陷:大语言模型能够检测负样本描述中的细微变化,并寻找集中式描述作为预测线索。为此,我们提出多重二元准确率(MBA)指标以修正此类偏差。我们希望TemporalBench能够推动提升模型时序推理能力的研究。数据集与评估代码将同步公开。