Streaming video question answering (Streaming Video QA) poses distinct challenges for multimodal large language models (MLLMs), as video frames arrive sequentially and user queries can be issued at arbitrary time points. Existing solutions relying on fixed-size memory or naive compression often suffer from context loss or memory overflow, limiting their effectiveness in long-form, real-time scenarios. We present Vista, a novel framework for scene-aware streaming video QA that enables efficient and scalable reasoning over continuous video streams. The innovation of Vista can be summarized in three aspects: (1) scene-aware segmentation, where Vista dynamically clusters incoming frames into temporally and visually coherent scene units; (2) scene-aware compression, where each scene is compressed into a compact token representation and stored in GPU memory for efficient index-based retrieval, while full-resolution frames are offloaded to CPU memory; and (3) scene-aware recall, where relevant scenes are selectively recalled and reintegrated into the model input upon receiving a query, enabling both efficiency and completeness. Vista is model-agnostic and integrates seamlessly with a variety of vision-language backbones, enabling long-context reasoning without compromising latency or memory efficiency. Extensive experiments on StreamingBench demonstrate that Vista achieves state-of-the-art performance, establishing a strong baseline for real-world streaming video understanding.
翻译:流式视频问答(Streaming Video QA)对多模态大语言模型(MLLMs)提出了独特的挑战,因为视频帧是顺序到达的,而用户查询可以在任意时间点发出。现有依赖于固定大小内存或简单压缩的解决方案常常遭受上下文丢失或内存溢出的问题,限制了其在长时、实时场景中的有效性。我们提出了Vista,一种用于场景感知流式视频问答的新型框架,能够在连续视频流上进行高效且可扩展的推理。Vista的创新可以概括为三个方面:(1)场景感知分割:Vista动态地将传入的视频帧聚类成时间和视觉上连贯的场景单元;(2)场景感知压缩:每个场景被压缩成一个紧凑的令牌表示并存储在GPU内存中,以实现高效的基于索引的检索,而全分辨率帧则被卸载到CPU内存;(3)场景感知召回:在接收到查询时,相关场景被选择性地召回并重新整合到模型输入中,从而同时实现效率和完整性。Vista是模型无关的,可以与各种视觉-语言骨干网络无缝集成,实现长上下文推理,而不会影响延迟或内存效率。在StreamingBench上进行的大量实验表明,Vista实现了最先进的性能,为现实世界的流式视频理解建立了一个强有力的基准。