From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents

from arxiv, Accepted by ACL 2026 Main. 17 pages, 7 figures, 8 tables. TL;DR: We propose MM-Mem, a cognition-inspired, dual-trace hierarchical memory framework for long-horizon video understanding grounded in Fuzzy-Trace Theory. It features adaptive memory compression via the Information Bottleneck and employs an entropy-driven top-down retrieval to access fine-grained details only when necessary

While multimodal large language models have demonstrated impressive short-term reasoning, they struggle with long-horizon video understanding due to limited context windows and static memory mechanisms that fail to mirror human cognitive efficiency. Existing paradigms typically fall into two extremes: vision-centric methods that incur high latency and redundancy through dense visual accumulation, or text-centric approaches that suffer from detail loss and hallucination via aggressive captioning. To bridge this gap, we propose MM-Mem, a pyramidal multimodal memory architecture grounded in Fuzzy-Trace Theory. MM-Mem structures memory hierarchically into a Sensory Buffer, Episodic Stream, and Symbolic Schema, enabling the progressive distillation of fine-grained perceptual traces (verbatim) into high-level semantic schemas (gist). Furthermore, to govern the dynamic construction of memory, we derive a Semantic Information Bottleneck objective and introduce SIB-GRPO to optimize the trade-off between memory compression and task-relevant information retention. In inference, we design an entropy-driven top-down memory retrieval strategy. Extensive experiments across 4 benchmarks confirm that MM-Mem achieves state-of-the-art performance on both offline and streaming tasks, demonstrating robust generalization and validating the effectiveness of cognition-inspired memory organization. Code and associated configurations are publicly available at https://github.com/EliSpectre/MM-Mem.

翻译：虽然多模态大语言模型在短时推理中展现出令人瞩目的能力，但受限于上下文窗口和静态记忆机制——这些机制未能模拟人类认知效率——它们在长时域视频理解任务中仍然面临挑战。现有范式通常陷入两种极端：以视觉为中心的方法通过密集的视觉累积带来高延迟与冗余，而以文本为中心的方法则因激进的字幕生成而导致细节丢失和幻觉。为弥合这一鸿沟，我们提出MM-Mem——一种基于模糊痕迹理论的金字塔式多模态记忆架构。MM-Mem将记忆层次化组织为感知缓冲区、情节流和符号图式，从而实现从精细感知痕迹（逐字信息）到高层语义图式（要旨）的渐进式蒸馏。此外，为调控记忆的动态构建，我们推导出语义信息瓶颈目标函数，并引入SIB-GRPO来优化记忆压缩与任务相关信息保留之间的权衡。在推理阶段，我们设计了一种基于熵驱动的自上而下记忆检索策略。在4个基准上的广泛实验表明，MM-Mem在离线任务和流式任务中均实现了最先进的性能，展现出稳健的泛化能力，并验证了受认知启发的记忆组织的有效性。相关代码与配置已开源至https://github.com/EliSpectre/MM-Mem。