The rapid development of Multimodal Large Language Models (MLLMs) has expanded their capabilities from image comprehension to video understanding. However, most of these MLLMs focus primarily on offline video comprehension, necessitating extensive processing of all video frames before any queries can be made. This presents a significant gap compared to the human ability to watch, listen, think, and respond to streaming inputs in real time, highlighting the limitations of current MLLMs. In this paper, we introduce StreamingBench, the first comprehensive benchmark designed to evaluate the streaming video understanding capabilities of MLLMs. StreamingBench assesses three core aspects of streaming video understanding: (1) real-time visual understanding, (2) omni-source understanding, and (3) contextual understanding. The benchmark consists of 18 tasks, featuring 900 videos and 4,500 human-curated QA pairs. Each video features five questions presented at different time points to simulate a continuous streaming scenario. We conduct experiments on StreamingBench with 13 open-source and proprietary MLLMs and find that even the most advanced proprietary MLLMs like Gemini 1.5 Pro and GPT-4o perform significantly below human-level streaming video understanding capabilities. We hope our work can facilitate further advancements for MLLMs, empowering them to approach human-level video comprehension and interaction in more realistic scenarios.
翻译:多模态大语言模型(MLLMs)的快速发展已将其能力从图像理解扩展到视频理解。然而,这些MLLMs大多主要关注离线视频理解,需要在任何查询之前对所有视频帧进行大量处理。这与人类实时观看、聆听、思考并响应流式输入的能力存在显著差距,突显了当前MLLMs的局限性。本文介绍了StreamingBench,这是首个旨在评估MLLMs流式视频理解能力的综合性基准。StreamingBench评估流式视频理解的三个核心方面:(1)实时视觉理解,(2)全源理解,以及(3)上下文理解。该基准包含18个任务,涵盖900个视频和4,500个人工标注的问答对。每个视频在五个不同时间点呈现问题,以模拟连续的流式场景。我们在StreamingBench上对13个开源和专有MLLMs进行了实验,发现即使是像Gemini 1.5 Pro和GPT-4o这样最先进的专有MLLMs,其流式视频理解能力也显著低于人类水平。我们希望我们的工作能够推动MLLMs的进一步发展,使其在更现实的场景中接近人类水平的视频理解与交互能力。