Video-based large language models (Video-LLMs) have been recently introduced, targeting both fundamental improvements in perception and comprehension, and a diverse range of user inquiries. In pursuit of the ultimate goal of achieving artificial general intelligence, a truly intelligent Video-LLM model should not only see and understand the surroundings, but also possess human-level commonsense, and make well-informed decisions for the users. To guide the development of such a model, the establishment of a robust and comprehensive evaluation system becomes crucial. To this end, this paper proposes \textit{Video-Bench}, a new comprehensive benchmark along with a toolkit specifically designed for evaluating Video-LLMs. The benchmark comprises 10 meticulously crafted tasks, evaluating the capabilities of Video-LLMs across three distinct levels: Video-exclusive Understanding, Prior Knowledge-based Question-Answering, and Comprehension and Decision-making. In addition, we introduce an automatic toolkit tailored to process model outputs for various tasks, facilitating the calculation of metrics and generating convenient final scores. We evaluate 8 representative Video-LLMs using \textit{Video-Bench}. The findings reveal that current Video-LLMs still fall considerably short of achieving human-like comprehension and analysis of real-world videos, offering valuable insights for future research directions. The benchmark and toolkit are available at: \url{https://github.com/PKU-YuanGroup/Video-Bench}.
翻译:近年来,基于视频的大语言模型(Video-LLMs)相继被提出,旨在同时提升感知理解能力并应对多样化的用户查询需求。为实现通用人工智能的终极目标,真正智能的视频大语言模型不仅应能观察和理解环境,还需具备人类级别的常识推理能力,并为用户做出明智决策。为引导此类模型的发展,构建稳健而全面的评估体系至关重要。为此,本文提出\textit{Video-Bench}——一个全新的综合基准及其配套工具包,专用于评估视频大语言模型。该基准包含10个精心设计的任务,从视频专属理解、基于先验知识的问答、以及理解与决策三个不同层面评估Video-LLMs的能力。此外,我们开发了一套自动化工具包,可针对不同任务处理模型输出,辅助计算指标并生成便捷的最终评分。我们使用\textit{Video-Bench}对8个代表性Video-LLMs进行了评估。结果表明,当前视频大语言模型在实现对真实世界视频的人类级理解与分析方面仍有显著差距,为未来研究方向提供了宝贵启示。该基准与工具包可通过\url{https://github.com/PKU-YuanGroup/Video-Bench}获取。