Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated significant improvement in offline video understanding. However, extending these capabilities to streaming video inputs, remains challenging, as existing models struggle to simultaneously maintain stable understanding performance, real-time responses, and low GPU memory overhead. To address this challenge, we propose HERMES, a novel training-free architecture for real-time and accurate understanding of video streams. Based on a mechanistic attention investigation, we conceptualize KV cache as a hierarchical memory framework that encapsulates video information across multiple granularities. During inference, HERMES reuses a compact KV cache, enabling efficient streaming understanding under resource constraints. Notably, HERMES requires no auxiliary computations upon the arrival of user queries, thereby guaranteeing real-time responses for continuous video stream interactions, which achieves 10$\times$ faster TTFT compared to prior SOTA. Even when reducing video tokens by up to 68% compared with uniform sampling, HERMES achieves superior or comparable accuracy across all benchmarks, with up to 11.4% gains on streaming datasets.
翻译:近年来,多模态大语言模型(MLLMs)在离线视频理解方面取得了显著进展。然而,将其能力扩展至流式视频输入仍然面临挑战,因为现有模型难以同时维持稳定的理解性能、实时响应能力以及较低的GPU内存开销。为应对这一挑战,我们提出了HERMES,一种无需训练的新型架构,用于实时且准确地理解视频流。基于对注意力机制的机理性研究,我们将KV缓存概念化为一个分层内存框架,该框架可在多个粒度上封装视频信息。在推理过程中,HERMES复用紧凑的KV缓存,从而在资源受限条件下实现高效的流式理解。值得注意的是,HERMES在用户查询到达时无需任何辅助计算,从而保证了连续视频流交互的实时响应,其首词生成时间(TTFT)相比先前的最先进方法(SOTA)提升了10倍。即使与均匀采样相比,视频令牌数量减少高达68%,HERMES在所有基准测试中仍实现了相当或更优的准确率,在流式数据集上的性能提升最高可达11.4%。