Multimodal Large Language Models (MLLMs) have made rapid progress in perception, understanding, and reasoning, yet existing benchmarks fall short in evaluating these abilities under continuous and dynamic real-world video streams. Such settings require models to maintain coherent understanding and reasoning as visual scenes evolve over time. **We introduce RTV-Bench, a fine-grained benchmark for real-time video analysis with MLLMs**. It is built upon three key principles: multi-timestamp question answering, hierarchical question structures spanning perception and reasoning, and multi-dimensional evaluation of continuous perception, understanding, and reasoning. RTV-Bench comprises 552 diverse videos and 4,608 carefully curated QA pairs covering a wide range of dynamic scenarios. We evaluate a broad range of state-of-the-art MLLMs, including proprietary, open-source offline, and open-source real-time models. Our results show that real-time models generally outperform offline counterparts but still lag behind leading proprietary systems. While scaling model capacity generally yields performance gains, simply increasing the density of sampled input frames does not consistently translate into improved results. These observations suggest inherent limitations in current architectures when handling long-horizon video streams, underscoring the need for models explicitly designed for streaming video processing and analysis.
翻译:多模态大语言模型(MLLMs)在感知、理解与推理方面取得了快速进展,然而现有基准测试在评估这些能力于连续、动态的真实世界视频流下的表现方面存在不足。此类场景要求模型能够在视觉场景随时间演变的过程中保持连贯的理解与推理。**我们提出了RTV-Bench,一个用于MLLM实时视频分析的细粒度基准测试**。其构建基于三个核心原则:多时间戳问答、跨越感知与推理的层次化问题结构,以及对连续感知、理解与推理的多维度评估。RTV-Bench包含552个多样化视频和4,608个精心构建的问答对,涵盖了广泛的动态场景。我们评估了一系列最先进的MLLMs,包括闭源模型、开源离线模型以及开源实时模型。我们的结果表明,实时模型通常优于离线模型,但仍落后于领先的闭源系统。虽然扩展模型容量通常会带来性能提升,但仅仅增加输入帧的采样密度并不能持续转化为结果的改进。这些观察结果表明,当前架构在处理长时域视频流时存在固有局限,突显了需要专门为流式视频处理与分析设计的模型。