From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents

from arxiv, Accepted by ACL 2026 Main. 17 pages, 7 figures, 8 tables. TL;DR: We propose MM-Mem, a cognition-inspired, dual-trace hierarchical memory framework for long-horizon video understanding grounded in Fuzzy-Trace Theory. It features adaptive memory compression via the Information Bottleneck and employs an entropy-driven top-down retrieval to access fine-grained details only when necessary

While multimodal large language models have demonstrated impressive short-term reasoning, they struggle with long-horizon video understanding due to limited context windows and static memory mechanisms that fail to mirror human cognitive efficiency. Existing paradigms typically fall into two extremes: vision-centric methods that incur high latency and redundancy through dense visual accumulation, or text-centric approaches that suffer from detail loss and hallucination via aggressive captioning. To bridge this gap, we propose MM-Mem, a pyramidal multimodal memory architecture grounded in Fuzzy-Trace Theory. MM-Mem structures memory hierarchically into a Sensory Buffer, Episodic Stream, and Symbolic Schema, enabling the progressive distillation of fine-grained perceptual traces (verbatim) into high-level semantic schemas (gist). Furthermore, to govern the dynamic construction of memory, we derive a Semantic Information Bottleneck objective and introduce SIB-GRPO to optimize the trade-off between memory compression and task-relevant information retention. In inference, we design an entropy-driven top-down memory retrieval strategy. Extensive experiments across 4 benchmarks confirm that MM-Mem achieves state-of-the-art performance on both offline and streaming tasks, demonstrating robust generalization and validating the effectiveness of cognition-inspired memory organization. Code and associated configurations are publicly available at https://github.com/EliSpectre/MM-Mem.

翻译：尽管多模态大语言模型在短期推理方面表现出色，但由于上下文窗口有限以及无法模拟人类认知效率的静态记忆机制，其在长程视频理解任务中仍面临挑战。现有范式通常陷入两种极端：以视觉为中心的方法通过密集视觉累积导致高延迟与冗余，而以文本为中心的方法则因激进的标题生成造成细节丢失与幻觉。为弥合这一鸿沟，我们提出MM-Mem——一种基于模糊痕迹理论的金字塔式多模态记忆架构。MM-Mem将记忆分层组织为感知缓冲区、情景流与符号图式，从而能够将细粒度感知痕迹（字面信息）逐步提炼为高层语义图式（要旨信息）。此外，为控制记忆的动态构建，我们推导出语义信息瓶颈目标函数，并引入SIB-GRPO以优化记忆压缩与任务相关信息保留之间的权衡。在推理阶段，我们设计了一种熵驱动的自上而下记忆检索策略。在4个基准数据集上的大量实验证实，MM-Mem在离线和流式任务中均取得了最先进性能，展现了强大的泛化能力，并验证了受认知启发的记忆组织的有效性。代码及配套配置已公开于https://github.com/EliSpectre/MM-Mem。