Video-based large language models (Video-LLMs) have been recently introduced, targeting both fundamental improvements in perception and comprehension, and a diverse range of user inquiries. In pursuit of the ultimate goal of achieving artificial general intelligence, a truly intelligent Video-LLM model should not only see and understand the surroundings, but also possess human-level commonsense, and make well-informed decisions for the users. To guide the development of such a model, the establishment of a robust and comprehensive evaluation system becomes crucial. To this end, this paper proposes \textit{Video-Bench}, a new comprehensive benchmark along with a toolkit specifically designed for evaluating Video-LLMs. The benchmark comprises 10 meticulously crafted tasks, evaluating the capabilities of Video-LLMs across three distinct levels: Video-exclusive Understanding, Prior Knowledge-based Question-Answering, and Comprehension and Decision-making. In addition, we introduce an automatic toolkit tailored to process model outputs for various tasks, facilitating the calculation of metrics and generating convenient final scores. We evaluate 8 representative Video-LLMs using \textit{Video-Bench}. The findings reveal that current Video-LLMs still fall considerably short of achieving human-like comprehension and analysis of real-world videos, offering valuable insights for future research directions. The benchmark and toolkit are available at: \url{https://github.com/PKU-YuanGroup/Video-Bench}.
翻译:基于视频的大语言模型(Video-LLMs)近期被提出,旨在同时提升感知理解能力与应对多样化用户查询的能力。为追求通用人工智能的终极目标,一个真正智能的Video-LLM模型不仅应能观察和理解周围环境,还需具备人类水平的常识推理能力,并能为用户做出明智决策。为引导此类模型的发展,构建可靠且全面的评估体系至关重要。为此,本文提出\textit{Video-Bench}——一个专为评估Video-LLMs设计的新型综合基准与工具包。该基准包含10项精心设计的任务,从三个不同维度评估Video-LLMs的能力:视频专属理解、基于先验知识的问答,以及理解与决策。此外,我们引入自动工具包,可处理模型在不同任务中的输出结果,便于计算评估指标并生成便捷的最终得分。我们使用\textit{Video-Bench}对8个代表性Video-LLMs进行了评估。结果表明,当前Video-LLMs在实现类人级别的真实视频理解与分析方面仍存在显著差距,为未来研究方向提供了宝贵见解。基准与工具包详见:\url{https://github.com/PKU-YuanGroup/Video-Bench}。