Benefiting from the advancements in large language models and cross-modal alignment, existing multi-modal video understanding methods have achieved prominent performance in offline scenario. However, online video streams, as one of the most common media forms in the real world, have seldom received attention. Compared to offline videos, the 'dynamic' nature of online video streams poses challenges for the direct application of existing models and introduces new problems, such as the storage of extremely long-term information, interaction between continuous visual content and 'asynchronous' user questions. Therefore, in this paper we present Flash-VStream, a video-language model that simulates the memory mechanism of human. Our model is able to process extremely long video streams in real-time and respond to user queries simultaneously. Compared to existing models, Flash-VStream achieves significant reductions in inference latency and VRAM consumption, which is intimately related to performing understanding of online streaming video. In addition, given that existing video understanding benchmarks predominantly concentrate on offline scenario, we propose VStream-QA, a novel question answering benchmark specifically designed for online video streaming understanding. Comparisons with popular existing methods on the proposed benchmark demonstrate the superiority of our method for such challenging setting. To verify the generalizability of our approach, we further evaluate it on existing video understanding benchmarks and achieves state-of-the-art performance in offline scenarios as well. All code, models, and datasets are available at the https://invinciblewyq.github.io/vstream-page/
翻译:受益于大语言模型和跨模态对齐技术的进步,现有多模态视频理解方法在离线场景下取得了显著性能。然而,作为现实世界中最常见的媒体形式之一,在线视频流却鲜少受到关注。与离线视频相比,在线视频流的"动态"特性给现有模型的直接应用带来了挑战,并引入了新问题,例如超长时间信息的存储、连续视觉内容与"异步"用户问题之间的交互等。因此,本文提出Flash-VStream——一种模拟人类记忆机制的视频-语言模型。该模型能够实时处理极长视频流,同时响应用户查询。与现有模型相比,Flash-VStream在推理延迟和显存消耗方面实现了显著降低,这对于在线流媒体视频的理解至关重要。此外,鉴于现有视频理解基准主要聚焦离线场景,我们提出了VStream-QA,一个专门针对在线视频流理解的新型问答基准。在该基准上与现有流行方法的比较表明,我们的方法在此类具有挑战性的场景中具有优越性。为验证方法的泛化能力,我们进一步在现有视频理解基准上进行了评估,并在离线场景中也取得了最先进的性能。所有代码、模型和数据集均可在https://invinciblewyq.github.io/vstream-page/获取。