Understanding long videos, ranging from tens of minutes to several hours, presents unique challenges in video comprehension. Despite the increasing importance of long-form video content, existing benchmarks primarily focus on shorter clips. To address this gap, we introduce InfiniBench a comprehensive benchmark for very long video understanding which presents 1)The longest video duration, averaging 76.34 minutes; 2) The largest number of question-answer pairs, 108.2K; 3) Diversity in questions that examine nine different skills and include both multiple-choice questions and open-ended questions; 4) Humancentric, as the video sources come from movies and daily TV shows, with specific human-level question designs such as Movie Spoiler Questions that require critical thinking and comprehensive understanding. Using InfiniBench, we comprehensively evaluate existing Large MultiModality Models (LMMs) on each skill, including the commercial model Gemini 1.5 Flash and the open-source models. The evaluation shows significant challenges in our benchmark.Our results show that the best AI models such Gemini struggles to perform well with 42.72% average accuracy and 2.71 out of 5 average score. We hope this benchmark will stimulate the LMMs community towards long video and human-level understanding. Our benchmark can be accessed at https://vision-cair.github.io/InfiniBench/
翻译:理解从数十分钟到数小时的长视频,在视频理解领域提出了独特的挑战。尽管长视频内容的重要性日益凸显,现有基准主要集中于较短片段。为填补这一空白,我们提出了InfiniBench——一个面向超长视频理解的综合基准,其具备以下特点:1)最长的视频时长,平均达76.34分钟;2)最大规模的问答对数量,共计108.2K;3)问题多样性,涵盖九种不同技能维度,同时包含多项选择题与开放式问题;4)以人为中心,视频源选自电影与日常电视节目,并设计了具备人类认知特性的问题类型(如需要批判性思维与全面理解的"电影剧透问题")。基于InfiniBench,我们对现有大型多模态模型(包括商用模型Gemini 1.5 Flash及开源模型)的各项技能进行了系统评估。评估结果表明当前模型在本基准上仍面临显著挑战:最佳AI模型(如Gemini)的平均准确率仅为42.72%,5分制平均得分仅2.71。我们期望本基准能推动LMM研究社区向长视频理解及人类级认知能力的方向发展。本基准可通过https://vision-cair.github.io/InfiniBench/ 访问。