Multimodal Large Language Models (MLLMs) have recently achieved remarkable progress in video understanding. However, their effectiveness in real-time streaming scenarios remains limited due to storage constraints of historical visual features and insufficient real-time spatiotemporal reasoning. To address these challenges, we propose StreamForest, a novel architecture specifically designed for streaming video understanding. Central to StreamForest is the Persistent Event Memory Forest, a memory mechanism that adaptively organizes video frames into multiple event-level tree structures. This process is guided by penalty functions based on temporal distance, content similarity, and merge frequency, enabling efficient long-term memory retention under limited computational resources. To enhance real-time perception, we introduce a Fine-grained Spatiotemporal Window, which captures detailed short-term visual cues to improve current scene perception. Additionally, we present OnlineIT, an instruction-tuning dataset tailored for streaming video tasks. OnlineIT significantly boosts MLLM performance in both real-time perception and future prediction. To evaluate generalization in practical applications, we introduce ODV-Bench, a new benchmark focused on real-time streaming video understanding in autonomous driving scenarios. Experimental results demonstrate that StreamForest achieves the state-of-the-art performance, with accuracies of 77.3% on StreamingBench, 60.5% on OVBench, and 55.6% on OVO-Bench. In particular, even under extreme visual token compression (limited to 1024 tokens), the model retains 96.8% of its average accuracy in eight benchmarks relative to the default setting. These results underscore the robustness, efficiency, and generalizability of StreamForest for streaming video understanding.
翻译:多模态大语言模型(MLLMs)近期在视频理解领域取得了显著进展。然而,由于历史视觉特征的存储限制以及实时时空推理能力不足,其在实时流式场景中的有效性仍然有限。为应对这些挑战,我们提出了StreamForest,一种专为流式视频理解设计的新型架构。StreamForest的核心是持久事件记忆森林,这是一种记忆机制,能够自适应地将视频帧组织成多个事件级别的树状结构。该过程由基于时间距离、内容相似度和合并频率的惩罚函数引导,从而在有限计算资源下实现高效的长时记忆保持。为增强实时感知能力,我们引入了细粒度时空窗口,以捕捉详细的短期视觉线索,从而改善当前场景的感知。此外,我们提出了OnlineIT,一个专为流式视频任务定制的指令微调数据集。OnlineIT显著提升了MLLM在实时感知和未来预测两方面的性能。为评估实际应用中的泛化能力,我们引入了ODV-Bench,这是一个专注于自动驾驶场景中实时流式视频理解的新基准测试。实验结果表明,StreamForest实现了最先进的性能,在StreamingBench、OVBench和OVO-Bench上的准确率分别达到77.3%、60.5%和55.6%。特别地,即使在极端的视觉令牌压缩下(限制为1024个令牌),模型在八个基准测试中的平均准确率相对于默认设置仍保持了96.8%。这些结果充分证明了StreamForest在流式视频理解方面的鲁棒性、高效性和泛化能力。