With the rapid development of video Multimodal Large Language Models (MLLMs), numerous benchmarks have been proposed to assess their video understanding capability. However, due to the lack of rich events in the videos, these datasets may suffer from the short-cut bias that the answers can be deduced from a few frames, without the need to watch the entire video. To address this issue, we introduce Event-Bench, an event-oriented long video understanding benchmark built on existing datasets and human annotations. Event-Bench includes six event-related tasks and 2,190 test instances to comprehensively evaluate video event understanding ability. Additionally, we propose Video Instruction Merging~(VIM), a cost-effective method that enhances video MLLMs using merged, event-intensive video instructions, addressing the scarcity of human-annotated, event-intensive data. Extensive experiments show that the best-performing model, GPT-4o, achieves an overall accuracy of 53.33, significantly outperforming the best open-source model by 41.42%. Leveraging an effective instruction synthesis method and an adaptive model architecture, VIM surpasses both state-of-the-art open-source models and GPT-4V on the Event-Bench. All code, data, and models are publicly available at https://github.com/RUCAIBox/Event-Bench.
翻译:随着视频多模态大语言模型(MLLMs)的快速发展,已有众多基准被提出以评估其视频理解能力。然而,由于视频中缺乏丰富的事件,这些数据集可能受到捷径偏差的影响,即答案可以从少数几帧中推断出来,而无需观看整个视频。为解决此问题,我们引入了Event-Bench,一个基于现有数据集和人工标注构建的、面向事件的长视频理解基准。Event-Bench包含六项事件相关任务和2,190个测试实例,以全面评估视频事件理解能力。此外,我们提出了视频指令合并(VIM),一种通过合并事件密集的视频指令来增强视频MLLMs的经济高效方法,以应对人工标注的事件密集型数据稀缺的问题。大量实验表明,性能最佳的模型GPT-4o在Event-Bench上的总体准确率达到53.33%,显著优于最佳开源模型41.42%。凭借有效的指令合成方法和自适应模型架构,VIM在Event-Bench上的表现超越了当前最先进的开源模型和GPT-4V。所有代码、数据和模型均已公开,可在https://github.com/RUCAIBox/Event-Bench获取。