V2P-Bench: Evaluating Video-Language Understanding with Visual Prompts for Better Human-Model Interaction

Large Vision-Language Models (LVLMs) have made significant strides in the field of video understanding in recent times. Nevertheless, existing video benchmarks predominantly rely on text prompts for evaluation, which often require complex referential language and diminish both the accuracy and efficiency of human model interaction in turn. To address this limitation, we propose V2P-Bench, a robust and comprehensive benchmark for evaluating the ability of LVLMs to understand Video Visual Prompts in human model interaction scenarios. V2P-Bench consists of 980 videos and 1172 well-structured high-quality QA pairs, each paired with manually annotated visual prompt frames. The benchmark spans three main tasks and twelve categories, thereby enabling fine-grained, instance-level evaluation. Through an in-depth analysis of current LVLMs, we identify several key findings: 1) Visual prompts are both more model-friendly and user-friendly in interactive scenarios than text prompts, leading to significantly improved model performance and enhanced user experience. 2) Models are reasonably capable of zero-shot understanding of visual prompts, but struggle with spatiotemporal understanding. Even o1 achieves only 71.8%, far below the human expert score of 88.3%, while most open-source models perform below 60%. 3) LVLMs exhibit pervasive Hack Phenomena in video question answering tasks, which become more pronounced as video length increases and frame sampling density decreases, thereby inflating performance scores artificially. We anticipate that V2P-Bench will not only shed light on these challenges but also serve as a foundational tool for advancing human model interaction and improving the evaluation of video understanding.

翻译：近年来，大规模视觉-语言模型在视频理解领域取得了显著进展。然而，现有的视频基准测试主要依赖文本提示进行评估，这通常需要复杂的指代语言，从而降低了人机交互的准确性和效率。为应对这一局限，我们提出了V2P-Bench——一个稳健且全面的基准测试，用于评估LVLMs在人机交互场景中理解视频视觉提示的能力。V2P-Bench包含980个视频和1172个结构良好的高质量问答对，每个问答对均配有手动标注的视觉提示帧。该基准涵盖三大任务和十二个类别，从而支持细粒度的实例级评估。通过对当前LVLMs的深入分析，我们得出以下关键发现：1）在交互场景中，视觉提示比文本提示更友好于模型和用户，能显著提升模型性能并改善用户体验。2）模型具备一定的零样本视觉提示理解能力，但在时空理解方面存在困难。即使是o1模型也仅达到71.8%，远低于人类专家88.3%的得分，而大多数开源模型表现低于60%。3）LVLMs在视频问答任务中普遍存在“伪解现象”，随着视频时长增加和帧采样密度降低，该现象愈发显著，从而导致性能得分被人为夸大。我们期待V2P-Bench不仅能揭示这些挑战，更能成为推进人机交互和改善视频理解评估的基础工具。