With the rapid development of Multi-modal Large Language Models (MLLMs), a number of diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities of these models. However, most benchmarks predominantly assess spatial understanding in the static image tasks, while overlooking temporal understanding in the dynamic video tasks. To alleviate this issue, we introduce a comprehensive Multi-modal Video understanding Benchmark, namely MVBench, which covers 20 challenging video tasks that cannot be effectively solved with a single frame. Specifically, we first introduce a novel static-to-dynamic method to define these temporal-related tasks. By transforming various static tasks into dynamic ones, we enable the systematic generation of video tasks that require a broad spectrum of temporal skills, ranging from perception to cognition. Then, guided by the task definition, we automatically convert public video annotations into multiple-choice QA to evaluate each task. On one hand, such a distinct paradigm allows us to build MVBench efficiently, without much manual intervention. On the other hand, it guarantees evaluation fairness with ground-truth video annotations, avoiding the biased scoring of LLMs. Moreover, we further develop a robust video MLLM baseline, i.e., VideoChat2, by progressive multi-modal training with diverse instruction-tuning data. The extensive results on our MVBench reveal that, the existing MLLMs are far from satisfactory in temporal understanding, while our VideoChat2 largely surpasses these leading models by over 15% on MVBench. All models and data are available at https://github.com/OpenGVLab/Ask-Anything.
翻译:随着多模态大语言模型(MLLMs)的快速发展,近期涌现出大量诊断基准用于评估这些模型的理解能力。然而,大多数基准主要评估静态图像任务中的空间理解能力,而忽视了动态视频任务中的时间理解能力。为解决这一问题,我们提出一个全面的多模态视频理解基准——MVBench,涵盖20项无法通过单帧图像有效解决的具有挑战性的视频任务。具体而言,我们首先提出一种新颖的静态到动态转换方法以定义这些时序相关任务。通过将各类静态任务转化为动态任务,我们能够系统性地生成需要广泛时间技能(从感知到认知)的视频任务。随后,基于任务定义,我们自动将公开视频标注转换为多项选择题问答以评估每项任务。一方面,这种独特范式使我们能够高效构建MVBench,无需过多人工干预;另一方面,它通过基于真实视频标注的评估保证了评估公平性,避免了LLM的偏置评分。此外,我们进一步开发了一个稳健的视频MLLM基线——VideoChat2,该模型通过渐进式多模态训练和多样化指令微调数据构建而成。在MVBench上的广泛结果表明,现有MLLM在时间理解方面远未达到令人满意的水平,而我们的VideoChat2在MVBench上以超过15%的幅度大幅超越领先模型。所有模型与数据均可在https://github.com/OpenGVLab/Ask-Anything获取。