Recent progress in multimodal large language models has markedly enhanced the understanding of short videos (typically under one minute), and several evaluation datasets have emerged accordingly. However, these advancements fall short of meeting the demands of real-world applications such as embodied intelligence for long-term decision-making, in-depth movie reviews and discussions, and live sports commentary, all of which require comprehension of long videos spanning several hours. To address this gap, we introduce LVBench, a benchmark specifically designed for long video understanding. Our dataset comprises publicly sourced videos and encompasses a diverse set of tasks aimed at long video comprehension and information extraction. LVBench is designed to challenge multimodal models to demonstrate long-term memory and extended comprehension capabilities. Our extensive evaluations reveal that current multimodal models still underperform on these demanding long video understanding tasks. Through LVBench, we aim to spur the development of more advanced models capable of tackling the complexities of long video comprehension. Our data and code are publicly available at: https://lvbench.github.io.
翻译:近年来,多模态大语言模型的进展显著增强了对短视频(通常短于一分钟)的理解能力,相应的若干评估数据集也随之出现。然而,这些进展尚不足以满足现实世界应用的需求,例如需要理解长达数小时视频的具身智能长期决策、深度电影评论与讨论以及体育赛事实时解说。为弥补这一差距,我们引入了LVBench,一个专门为长视频理解设计的基准。我们的数据集包含公开来源的视频,并涵盖了一系列旨在实现长视频理解和信息提取的多样化任务。LVBench旨在挑战多模态模型,以展示其长期记忆和扩展理解能力。我们广泛的评估表明,当前的多模态模型在这些要求苛刻的长视频理解任务上仍然表现不佳。通过LVBench,我们期望推动能够应对长视频理解复杂性的更先进模型的开发。我们的数据与代码公开于:https://lvbench.github.io。