The advent of large vision-language models (LVLMs) has spurred research into their applications in multi-modal contexts, particularly in video understanding. Traditional VideoQA benchmarks, despite providing quantitative metrics, often fail to encompass the full spectrum of video content and inadequately assess models' temporal comprehension. To address these limitations, we introduce MMBench-Video, a quantitative benchmark designed to rigorously evaluate LVLMs' proficiency in video understanding. MMBench-Video incorporates lengthy videos from YouTube and employs free-form questions, mirroring practical use cases. The benchmark is meticulously crafted to probe the models' temporal reasoning skills, with all questions human-annotated according to a carefully constructed ability taxonomy. We employ GPT-4 for automated assessment, demonstrating superior accuracy and robustness over earlier LLM-based evaluations. Utilizing MMBench-Video, we have conducted comprehensive evaluations that include both proprietary and open-source LVLMs for images and videos. MMBench-Video stands as a valuable resource for the research community, facilitating improved evaluation of LVLMs and catalyzing progress in the field of video understanding. The evalutation code of MMBench-Video will be integrated into VLMEvalKit: https://github.com/open-compass/VLMEvalKit.
翻译:大型视觉语言模型(LVLMs)的出现推动了其在多模态场景中的应用研究,特别是在视频理解领域。传统的视频问答(VideoQA)基准测试虽然提供了量化指标,但往往无法涵盖视频内容的全部维度,且难以充分评估模型的时间理解能力。为应对这些局限性,我们提出了MMBench-Video,这是一个旨在严格评估LVLMs视频理解能力的量化基准测试。MMBench-Video采用来自YouTube的长视频,并设计了自由形式的问题,以贴近实际应用场景。该基准测试经过精心构建,旨在深入探究模型的时间推理能力,所有问题均依据精心构建的能力分类体系进行人工标注。我们采用GPT-4进行自动化评估,结果表明其相较于早期基于LLM的评估方法具有更高的准确性和鲁棒性。利用MMBench-Video,我们对专有及开源的图像与视频LVLMs进行了全面评估。MMBench-Video为研究社区提供了一个宝贵的资源,有助于改进LVLMs的评估,并推动视频理解领域的进步。MMBench-Video的评估代码将集成至VLMEvalKit:https://github.com/open-compass/VLMEvalKit。