Streaming video understanding models must answer queries at any moment during an ongoing stream, using only what they have observed so far and under fixed memory and computation budgets. Existing methods address this by adding memory banks, retrieval modules, or visual token compression to preserve long-range history. However, strong recent-window baselines show that indiscriminate history injection can dilute current-scene perception, suggesting that the key challenge is not whether to use memory, but how to allocate it selectively. We formulate this as budgeted online latent evidence allocation and propose \textbf{SelectStream}, a selective latent-memory framework that keeps the current observation directly visible to a frozen VLM while exposing historical information only through a compact, query-conditioned evidence budget. Three coordinated mechanisms govern when to write, what to preserve, and how to retrieve: surprise-driven adaptive windowing, priority-preserving consolidation, and query-conditioned graph reasoning over a fixed-capacity latent memory graph. Retrieved evidence is calibrated and injected as latent tokens for answer generation, without replaying frames or growing the context with stream length. Experimental results show that SelectStream achieves strong online streaming performance and preserves general video understanding, reaching 82.67\% on StreamingBench, 67.03\% on OVO-Bench, and 74.4\% average accuracy on offline video benchmarks, while outperforming strong recent-window baselines and prior streaming memory methods.
翻译:流式视频理解模型必须在持续的视频流中随时回答查询,仅依赖已观察到的内容,并在固定的内存和计算预算下运行。现有方法通过添加记忆库、检索模块或视觉令牌压缩来保留长期历史信息。然而,强大的近期窗口基线表明,不加区分地注入历史信息会削弱当前场景的感知能力,这意味着核心挑战不在于是否使用记忆,而在于如何选择性地分配记忆。我们将其形式化为预算约束下的在线潜在证据分配问题,并提出**SelectStream**,一种选择性潜在记忆框架,该框架保持当前观测对冻结视觉语言模型直接可见,同时仅通过紧凑的、基于查询条件的证据预算暴露历史信息。三个协调机制控制何时写入、保留什么以及如何检索:基于惊讶值的自适应窗口化、优先级保留的整合,以及固定容量潜在记忆图上的查询条件图推理。检索到的证据经过校准后作为潜在令牌注入用于答案生成,无需回放帧或随流长度扩展上下文。实验结果表明,SelectStream在流式在线测试中表现强劲,并保持通用视频理解能力:在StreamingBench上达到82.67%,在OVO-Bench上达到67.03%,在离线视频基准测试中平均准确率达74.4%,优于强近期窗口基线和先前的流式记忆方法。