From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents

from arxiv, TL;DR: We propose MM-Mem, a cognition-inspired, dual-trace hierarchical memory framework for long-horizon video understanding grounded in Fuzzy-Trace Theory. It features adaptive memory compression via the Information Bottleneck and employs an entropy-driven top-down retrieval to access fine-grained details only when necessary. 16 pages, 7 figures, 7 tables

While multimodal large language models have demonstrated impressive short-term reasoning, they struggle with long-horizon video understanding due to limited context windows and static memory mechanisms that fail to mirror human cognitive efficiency. Existing paradigms typically fall into two extremes: vision-centric methods that incur high latency and redundancy through dense visual accumulation, or text-centric approaches that suffer from detail loss and hallucination via aggressive captioning. To bridge this gap, we propose MM-Mem, a pyramidal multimodal memory architecture grounded in Fuzzy-Trace Theory. MM-Mem structures memory hierarchically into a Sensory Buffer, Episodic Stream, and Symbolic Schema, enabling the progressive distillation of fine-grained perceptual traces (verbatim) into high-level semantic schemas (gist). Furthermore, to govern the dynamic construction of memory, we derive a Semantic Information Bottleneck objective and introduce SIB-GRPO to optimize the trade-off between memory compression and task-relevant information retention. In inference, we design an entropy-driven top-down memory retrieval strategy, which first tries with the abstract Symbolic Schema and progressively "drills down" to the Sensory Buffer and Episodic Stream under high uncertainty. Extensive experiments across 4 benchmarks confirm the effectiveness of MM-Mem on both offline and streaming tasks, demonstrating robust generalization and validating the effectiveness of cognition-inspired memory organization. Code is available at https://github.com/EliSpectre/MM-Mem.

翻译：尽管多模态大语言模型已展现出令人印象深刻的短期推理能力，但由于有限的上下文窗口和静态的记忆机制无法反映人类认知效率，它们在长时程视频理解方面仍面临挑战。现有范式通常陷入两个极端：以视觉为中心的方法通过密集的视觉积累导致高延迟和冗余，或以文本为中心的方法通过激进的描述生成遭受细节丢失和幻觉问题。为弥合这一差距，我们提出了MM-Mem，一种基于模糊痕迹理论的金字塔多模态记忆架构。MM-Mem将记忆分层组织为感觉缓冲区、情景流和符号图式，实现了从细粒度感知痕迹（逐字）到高层语义图式（要义）的渐进蒸馏。此外，为管理记忆的动态构建，我们推导了一个语义信息瓶颈目标，并引入SIB-GRPO以优化记忆压缩与任务相关信息保留之间的权衡。在推理过程中，我们设计了一种基于熵的自顶向下记忆检索策略，该策略首先尝试使用抽象的符号图式，并在高不确定性下逐步"向下钻取"至感觉缓冲区和情景流。在4个基准测试上的大量实验证实了MM-Mem在离线和流式任务上的有效性，展示了其鲁棒的泛化能力，并验证了受认知启发的记忆组织方式的有效性。代码可在 https://github.com/EliSpectre/MM-Mem 获取。