H2VU-Benchmark：面向层次化整体视频理解的综合基准 (H2VU-Benchmark: A Comprehensive Benchmark for Hierarchical Holistic Video Understanding)

With the rapid development of multimodal models, the demand for assessing video understanding capabilities has been steadily increasing. However, existing benchmarks for evaluating video understanding exhibit significant limitations in coverage, task diversity, and scene adaptability. These shortcomings hinder the accurate assessment of models' comprehensive video understanding capabilities. To tackle this challenge, we propose a hierarchical and holistic video understanding (H2VU) benchmark designed to evaluate both general video and online streaming video comprehension. This benchmark contributes three key features: Extended video duration: Spanning videos from brief 3-second clips to comprehensive 1.5-hour recordings, thereby bridging the temporal gaps found in current benchmarks. Comprehensive assessment tasks: Beyond traditional perceptual and reasoning tasks, we have introduced modules for countercommonsense comprehension and trajectory state tracking. These additions test the models' deep understanding capabilities beyond mere prior knowledge. Enriched video data: To keep pace with the rapid evolution of current AI agents, we have expanded first-person streaming video datasets. This expansion allows for the exploration of multimodal models' performance in understanding streaming videos from a first-person perspective. Extensive results from H2VU reveal that existing multimodal large language models (MLLMs) possess substantial potential for improvement in our newly proposed evaluation tasks. We expect that H2VU will facilitate advancements in video understanding research by offering a comprehensive and in-depth analysis of MLLMs.

翻译：随着多模态模型的快速发展，对视频理解能力的评估需求日益增长。然而，现有的视频理解评估基准在覆盖范围、任务多样性和场景适应性方面存在显著局限。这些不足阻碍了对模型综合视频理解能力的准确评估。为应对这一挑战，我们提出了一个层次化整体视频理解（H2VU）基准，旨在评估通用视频和在线流媒体视频的理解能力。该基准具备三个关键特性：扩展的视频时长：涵盖从3秒短片到1.5小时完整录像的视频，从而弥补了现有基准在时间尺度上的空白。全面的评估任务：除传统的感知与推理任务外，我们引入了反常识理解与轨迹状态追踪模块。这些新增任务旨在检验模型超越先验知识的深层理解能力。丰富的视频数据：为跟上当前智能体快速发展的步伐，我们扩充了第一人称流媒体视频数据集。这一扩展使得能够探索多模态模型在以第一人称视角理解流媒体视频方面的性能。H2VU的大量实验结果表明，现有的多模态大语言模型（MLLMs）在我们新提出的评估任务中具有巨大的改进潜力。我们期望H2VU能够通过对MLLMs进行全面深入的分析，推动视频理解研究的发展。