LiveStarPro: Proactive Streaming Video Understanding with Hierarchical Memory for Long-Horizon Streams

Despite the remarkable progress of Video Large Language Models (Video-LLMs), current online architectures still struggle to simultaneously process continuous video streams, decide autonomously when to respond, and preserve long-horizon contextual memory. These obstacles undermine real-time responsiveness and cause severe forgetting throughout prolonged interactions. In this work, we introduce LiveStarPro, a live streaming assistant that is designed for proactive video understanding over long-horizon streams. The design of LiveStarPro rests on three complementary components. The first component is Streaming Verification Decoding (SVeD), an inference framework that identifies the appropriate response timing through single-pass perplexity verification, thereby eliminating the dependency on explicit silence tokens. The second component is Streaming Causal Attention Masks (SCAM), a training strategy that enforces incremental video-language alignment over variable-length streams. The third component is Tree-Structured Hierarchical Memory (TSHM), a recursive memory architecture that organizes evicted historical information into event chains and consequently enables efficient retrieval from effectively unbounded video streams. To facilitate a comprehensive evaluation under realistic online conditions, we further present OmniStarPro, a large-scale benchmark that spans 15 diverse real-world scenarios and that extends to hour-scale streams for the assessment of long-term recall. Extensive experiments demonstrate that LiveStarPro consistently surpasses existing methods, attaining a 28.9% improvement in semantic correctness and an 18.2% reduction in timing error, while its streaming key-value cache further yields a 1.58x inference speedup over the same model without caching. The model and the code are publicly available at https://github.com/sotayang/LiveStarPro.

翻译：尽管视频大语言模型（Video-LLMs）取得了显著进展，但当前的在线架构仍难以同时处理连续的视频流、自主决定何时响应以及保持长期上下文记忆。这些障碍削弱了实时响应能力，并在长时交互过程中导致严重遗忘。本文提出LiveStarPro，一个专为长时流主动式视频理解设计的直播助手。LiveStarPro的设计基于三个互补组件。第一个组件是流式验证解码（Streaming Verification Decoding，SVeD），一个通过单遍困惑度验证确定合适响应时机的推理框架，从而消除对显式静默令牌的依赖。第二个组件是流式因果注意力掩码（Streaming Causal Attention Masks，SCAM），一种在可变长度流上实现渐进式视频-语言对齐的训练策略。第三个组件是树状分层记忆（Tree-Structured Hierarchical Memory，TSHM），一种递归记忆架构，将逐出的历史信息组织成事件链，从而能够从理论上无界的视频流中高效检索。为了在真实在线条件下促进全面评估，我们进一步提出了OmniStarPro，一个包含15个多样化真实场景且扩展至小时级流以评估长期召回能力的大规模基准。大量实验表明，LiveStarPro持续优于现有方法，在语义正确性上提升28.9%，时序误差降低18.2%，同时其流式键值缓存相比无缓存的同一模型实现了1.58倍的推理加速。模型和代码已公开于https://github.com/sotayang/LiveStarPro。