Streaming video understanding requires models to robustly encode, store, and retrieve information from a continuous video stream to support accurate video question answering (VQA). Existing state-of-the-art approaches rely on key-value caching to accumulate frame-level information over time, but use a limited number of tokens per frame, leading to the loss of fine-grained visual details. In this work, we propose scaling the token budget to enable more granular spatiotemporal understanding and reasoning. First, we find that current methods are ill-equipped to handle dense streams: their feature encoding causes query-frame similarity scores to increase over time, biasing retrieval toward later frames. To address this, we introduce an adaptive selection strategy that reduces token redundancy while preserving local spatiotemporal information. We further propose a training-free retrieval mixture-of-experts that leverages external models to better identify relevant frames. Our method, MemStream, achieves +8.0% on CG-Bench, +8.5% on LVBench, and +2.4% on VideoMME (Long) over ReKV with Qwen2.5-VL-7B.
翻译:流式视频理解要求模型能够稳健地对连续视频流进行编码、存储和检索信息,以支持准确的视频问答任务。现有最先进方法依赖键值缓存来随时间累积帧级信息,但每帧使用的令牌数量有限,导致细粒度视觉细节的丢失。在本研究中,我们提出扩展令牌预算以实现更精细的时空理解与推理。首先,我们发现当前方法难以处理密集视频流:其特征编码会导致查询-帧相似度分数随时间增加,使检索偏向后续帧。为解决此问题,我们引入了一种自适应选择策略,在保留局部时空信息的同时减少令牌冗余。我们进一步提出一种无需训练的检索专家混合方法,利用外部模型更好地识别相关帧。我们的方法MemStream在CG-Bench上实现了比ReKV(基于Qwen2.5-VL-7B)提升+8.0%,在LVBench上提升+8.5%,在VideoMME(长视频版)上提升+2.4%。