Online video understanding requires models to perform continuous perception and long-range reasoning within potentially infinite visual streams. Its fundamental challenge lies in the conflict between the unbounded nature of streaming media input and the limited context window of Multimodal Large Language Models (MLLMs). Current methods primarily rely on passive processing, which often face a trade-off between maintaining long-range context and capturing the fine-grained details necessary for complex tasks. To address this, we introduce EventMemAgent, an active online video agent framework based on a hierarchical memory module. Our framework employs a dual-layer strategy for online videos: short-term memory detects event boundaries and utilizes event-granular reservoir sampling to process streaming video frames within a fixed-length buffer dynamically; long-term memory structuredly archives past observations on an event-by-event basis. Furthermore, we integrate a multi-granular perception toolkit for active, iterative evidence capture and employ Agentic Reinforcement Learning (Agentic RL) to end-to-end internalize reasoning and tool-use strategies into the agent's intrinsic capabilities. Experiments show that EventMemAgent achieves competitive results on online video benchmarks. The code will be released here: https://github.com/lingcco/EventMemAgent.
翻译:在线视频理解要求模型在潜在无限的视觉流中进行持续感知和长距离推理。其核心挑战在于流媒体输入的无限性与多模态大语言模型(MLLMs)有限上下文窗口之间的冲突。现有方法主要依赖被动处理,往往需要在维持长距离上下文与捕获复杂任务所需的细粒度细节之间进行权衡。为此,我们提出了EventMemAgent,一种基于分层记忆模块的主动式在线视频智能体框架。该框架采用双层策略处理在线视频:短期记忆检测事件边界,并利用事件粒度蓄水池采样在固定长度缓冲区内动态处理流式视频帧;长期记忆则以事件为单位结构化归档历史观测。此外,我们集成了多粒度感知工具包以进行主动、迭代的证据捕获,并采用智能体强化学习(Agentic RL)将推理与工具使用策略端到端内化为智能体的内在能力。实验表明,EventMemAgent在在线视频基准测试中取得了具有竞争力的结果。代码将发布于:https://github.com/lingcco/EventMemAgent。