Video understanding is a crucial next step for multimodal large language models (MLLMs). Various benchmarks are introduced for better evaluating the MLLMs. Nevertheless, current video benchmarks are still inefficient for evaluating video models during iterative development due to the high cost of constructing datasets and the difficulty in isolating specific skills. In this paper, we propose VideoNIAH (Video Needle In A Haystack), a benchmark construction framework through synthetic video generation. VideoNIAH decouples video content from their query-responses by inserting unrelated visual 'needles' into original videos. The framework automates the generation of query-response pairs using predefined rules, minimizing manual labor. The queries focus on specific aspects of video understanding, enabling more skill-specific evaluations. The separation between video content and the queries also allow for increased video variety and evaluations across different lengths. Utilizing VideoNIAH, we compile a video benchmark VNBench, which includes tasks such as retrieval, ordering, and counting to evaluate three key aspects of video understanding: temporal perception, chronological ordering, and spatio-temporal coherence. We conduct a comprehensive evaluation of both proprietary and open-source models, uncovering significant differences in their video understanding capabilities across various tasks. Additionally, we perform an in-depth analysis of the test results and model configurations. Based on these findings, we provide some advice for improving video MLLM training, offering valuable insights to guide future research and model development. The code and data are available at https://github.com/joez17/VideoNIAH.
翻译:视频理解是多模态大语言模型(MLLMs)发展的关键方向。为更有效地评估MLLMs,学界已提出多种基准测试集。然而,现有视频基准在迭代开发过程中仍存在评估效率低下的问题,主要源于数据集构建成本高昂以及难以针对特定技能进行独立评估。本文提出VideoNIAH(视频寻针框架),一种基于合成视频生成的基准构建框架。该框架通过在原始视频中插入无关视觉“针状物”实现视频内容与查询-响应的解耦,并采用预定义规则自动生成查询-响应对以最小化人工标注成本。所设计的查询聚焦于视频理解的特定维度,支持更具针对性的技能评估。视频内容与查询的分离机制同时提升了视频样本的多样性,支持跨不同时长的评估。基于VideoNIAH框架,我们构建了视频基准测试集VNBench,涵盖检索、排序、计数等任务,用于评估视频理解的三个关键维度:时序感知、时间排序和时空一致性。我们对闭源与开源模型进行了全面评估,揭示了不同模型在各类任务中视频理解能力的显著差异。此外,我们深入分析了测试结果与模型配置,并基于研究发现提出了改进视频MLLM训练的建议,为未来研究与模型开发提供了重要参考。代码与数据已开源:https://github.com/joez17/VideoNIAH。