Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams

Benefiting from the advancements in large language models and cross-modal alignment, existing multi-modal video understanding methods have achieved prominent performance in offline scenario. However, online video streams, as one of the most common media forms in the real world, have seldom received attention. Compared to offline videos, the 'dynamic' nature of online video streams poses challenges for the direct application of existing models and introduces new problems, such as the storage of extremely long-term information, interaction between continuous visual content and 'asynchronous' user questions. Therefore, in this paper we present Flash-VStream, a video-language model that simulates the memory mechanism of human. Our model is able to process extremely long video streams in real-time and respond to user queries simultaneously. Compared to existing models, Flash-VStream achieves significant reductions in inference latency and VRAM consumption, which is intimately related to performing understanding of online streaming video. In addition, given that existing video understanding benchmarks predominantly concentrate on offline scenario, we propose VStream-QA, a novel question answering benchmark specifically designed for online video streaming understanding. Comparisons with popular existing methods on the proposed benchmark demonstrate the superiority of our method for such challenging setting. To verify the generalizability of our approach, we further evaluate it on existing video understanding benchmarks and achieves state-of-the-art performance in offline scenarios as well. All code, models, and datasets are available at the https://invinciblewyq.github.io/vstream-page/

翻译：受益于大语言模型与跨模态对齐技术的进步，现有的多模态视频理解方法在离线场景下已取得显著性能。然而，在线视频流作为现实世界中最常见的媒体形式之一，却鲜少受到关注。与离线视频相比，在线视频流的“动态”特性对现有模型的直接应用提出了挑战，并引入了新的问题，例如极长期信息的存储、连续视觉内容与“异步”用户提问之间的交互。为此，本文提出Flash-VStream，一种模拟人类记忆机制的视频-语言模型。我们的模型能够实时处理极长的视频流，并同时响应用户查询。与现有模型相比，Flash-VStream在推理延迟和显存消耗上实现了显著降低，这与实现在线流视频的理解密切相关。此外，鉴于现有的视频理解基准主要集中于离线场景，我们提出了VStream-QA，一个专为在线视频流理解设计的新型问答基准。在提出的基准上与现有流行方法的比较，证明了我们的方法在此类挑战性设置下的优越性。为验证我们方法的泛化能力，我们进一步在现有视频理解基准上对其进行了评估，并在离线场景中也取得了最先进的性能。所有代码、模型和数据集均发布于https://invinciblewyq.github.io/vstream-page/。