With the rapid development of Multi-modal Large Language Models (MLLMs), a number of diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities of these models. However, most benchmarks predominantly assess spatial understanding in the static image tasks, while overlooking temporal understanding in the dynamic video tasks. To alleviate this issue, we introduce a comprehensive Multi-modal Video understanding Benchmark, namely MVBench, which covers 20 challenging video tasks that cannot be effectively solved with a single frame. Specifically, we first introduce a novel static-to-dynamic method to define these temporal-related tasks. By transforming various static tasks into dynamic ones, we enable the systematic generation of video tasks that require a broad spectrum of temporal skills, ranging from perception to cognition. Then, guided by the task definition, we automatically convert public video annotations into multiple-choice QA to evaluate each task. On one hand, such a distinct paradigm allows us to build MVBench efficiently, without much manual intervention. On the other hand, it guarantees evaluation fairness with ground-truth video annotations, avoiding the biased scoring of LLMs. Moreover, we further develop a robust video MLLM baseline, i.e., VideoChat2, by progressive multi-modal training with diverse instruction-tuning data. The extensive results on our MVBench reveal that, the existing MLLMs are far from satisfactory in temporal understanding, while our VideoChat2 largely surpasses these leading models by over 15% on MVBench. All models and data are available at https://github.com/OpenGVLab/Ask-Anything.
翻译:随着多模态大语言模型(MLLMs)的快速发展,近期涌现出许多诊断性基准来评估这些模型的理解能力。然而,现有基准主要评估静态图像任务中的空间理解,而忽视了动态视频任务中的时序理解。为缓解这一问题,我们引入了一个全面的多模态视频理解基准,即MVBench,它涵盖了20个无法通过单帧图像有效解决的挑战性视频任务。具体而言,我们首先提出了一种新颖的静态到动态方法来定义这些与时序相关的任务。通过将各种静态任务转化为动态任务,我们能够系统性地生成需要从感知到认知的广泛时序技能的视频任务。随后,在任务定义的指导下,我们自动将公开的视频标注转换为多项选择题问答形式,以评估每个任务。一方面,这种独特的范式使我们能够高效构建MVBench,无需过多人工干预。另一方面,它通过使用真实视频标注保证了评估的公平性,避免了LLM评分偏差。此外,我们通过使用多样化的指令微调数据进行渐进式多模态训练,进一步开发了一个鲁棒的视频MLLM基线模型,即VideoChat2。我们在MVBench上的广泛实验结果表明,现有MLLM在时序理解方面远未达到令人满意的水平,而我们的VideoChat2在MVBench上大幅领先这些主流模型超过15%。所有模型与数据均公开于 https://github.com/OpenGVLab/Ask-Anything。