Recent streaming video understanding methods increasingly rely on complex memory mechanisms to handle long video streams. We challenge this trend with a simple finding: a sliding-window baseline that feeds only the most recent N frames to an off-the-shelf VLM already matches or surpasses published streaming models. We formalize this baseline as SimpleStream and evaluate it against 13 major offline and online video LLM baselines on OVO-Bench and StreamingBench. Despite its simplicity, SimpleStream delivers consistently strong performance. With only 4 recent frames, it reaches 67.7% average accuracy on OVO-Bench and 80.59% on StreamingBench. Controlled ablations further show that the value of longer context is backbone-dependent rather than uniformly increasing with model scale, and reveal a consistent perception-memory trade-off: adding more historical context can improve recall, but often weakens real-time perception. This suggests that stronger memory, retrieval, or compression modules should not be taken as evidence of progress unless they clearly outperform SimpleStream under the same protocol. We therefore argue that future streaming benchmarks should separate recent-scene perception from long-range memory, so that performance improvements from added complexity can be evaluated more clearly.
翻译:近期的流式视频理解方法日益依赖复杂的记忆机制来处理长视频流。我们通过一个简单发现挑战了这一趋势:一个滑动窗口基线——仅将最近N帧输入现成的视觉语言模型(VLM)——即可匹配或超越已发表的流式模型。我们将该基线形式化为SimpleStream,并在OVO-Bench和StreamingBench上,与13个主流的离线与在线视频大语言模型基线进行了评估。尽管方法简单,SimpleStream却持续展现出强劲性能。仅使用最近的4帧,它在OVO-Bench上达到67.7%的平均准确率,在StreamingBench上达到80.59%。受控消融实验进一步揭示,更长上下文的效用并非随模型规模统一增长,而是依赖于骨干网络;同时观察到一致的感知-记忆权衡:增加更多的历史上下文可以提升召回率,但往往会削弱实时感知能力。这一发现表明,除非在相同实验协议下明显优于SimpleStream,否则更强的记忆、检索或压缩模块不应被视为技术进步的证据。因此,我们主张未来的流式基准应分离近期场景感知与长程记忆,以便更清晰地评估由增加的复杂性所带来的性能提升。