Continuous episodic memory is a core capability for autonomous agents operating in dynamic, real-world environments, yet current streaming video benchmarks provide limited tools for diagnosing what models remember and for how long. We introduce \egostream, a diagnostic benchmark for streaming episodic memory evaluation in egocentric vision. \egostream organizes 2,250 curated questions along seven cognitive dimensions: detail, spatial, temporal, event, social, causal, and prospective memory. We introduce the Answer Validity Window (AVW), which specifies the temporal span an answer remains valid as the observed scene evolves. This allows us to expand the questions into 8,528 recall-conditioned evaluations, enabling controlled testing from instant to ultra-long-term recall while separating genuine model forgetting from natural world-state changes. We rigorously establish baseline performance through a unified streaming MLLM framework that compares several state-of-the-art memory-management mechanisms, covering sliding windows, attention sinks, KV-cache pruning, merging, and offloading. Experiments within a unified Qwen3-VL backbone reveal that comparable aggregate accuracies mask starkly different memory profiles. For instance, token pruning preserves fine-grained details and temporal structure significantly better than token merging, while quantized offloading rescues ultra-long-term recall. Ultimately, all mechanisms operate well below real-time (>1s per frame), and top performing methods ceil at about 45\% accuracy, exposing critical gaps in current architectures. \egostream provides the diagnostic testbed needed to close these gaps.
翻译:连续情景记忆是自主智能体在动态真实环境中运行的核心能力,然而当前流式视频基准在诊断模型记住什么信息以及记忆持续时长方面提供的工具十分有限。我们提出EGOSTREAM,一个用于自我中心视觉中流式情景记忆评估的诊断基准。EGOSTREAM沿着七个认知维度(细节记忆、空间记忆、时序记忆、事件记忆、社交记忆、因果记忆和前瞻记忆)组织了2250个精心设计的问题。我们引入了答案有效窗口(AVW),该窗口指定了随着观察场景演变,答案保持有效的时间跨度。这使得我们能够将问题扩展为8528个基于回忆条件的评估,从而在区分模型自然遗忘与现实世界状态变化的同时,实现对从即时回忆到超长期回忆的可控测试。我们通过一个统一的流式多模态大语言模型框架严格建立了基线性能,该框架比较了多种最先进的记忆管理机制,包括滑动窗口、注意力汇聚、KV缓存剪枝、合并与卸载。在统一的Qwen3-VL主干架构上的实验表明,相近的整体准确率掩盖了截然不同的记忆特征。例如,令牌剪枝在保留细粒度细节和时序结构方面显著优于令牌合并,而量化卸载则改善了超长期回忆。最终,所有机制的实际运行速度远低于实时要求(每帧处理时间超过1秒),且最佳方法的准确率上限约为45%,揭示了当前架构中的关键缺陷。EGOSTREAM提供了弥合这些缺陷所需的诊断测试平台。