While multimodal large language models have demonstrated impressive short-term reasoning, they struggle with long-horizon video understanding due to limited context windows and static memory mechanisms that fail to mirror human cognitive efficiency. Existing paradigms typically fall into two extremes: vision-centric methods that incur high latency and redundancy through dense visual accumulation, or text-centric approaches that suffer from detail loss and hallucination via aggressive captioning. To bridge this gap, we propose MM-Mem, a pyramidal multimodal memory architecture grounded in Fuzzy-Trace Theory. MM-Mem structures memory hierarchically into a Sensory Buffer, Episodic Stream, and Symbolic Schema, enabling the progressive distillation of fine-grained perceptual traces (verbatim) into high-level semantic schemas (gist). Furthermore, to govern the dynamic construction of memory, we derive a Semantic Information Bottleneck objective and introduce SIB-GRPO to optimize the trade-off between memory compression and task-relevant information retention. In inference, we design an entropy-driven top-down memory retrieval strategy. Extensive experiments across 4 benchmarks confirm that MM-Mem achieves state-of-the-art performance on both offline and streaming tasks, demonstrating robust generalization and validating the effectiveness of cognition-inspired memory organization. Code and associated configurations are publicly available at https://github.com/EliSpectre/MM-Mem.
翻译:尽管多模态大语言模型在短期推理方面表现出色,但由于上下文窗口有限以及无法模拟人类认知效率的静态记忆机制,其在长程视频理解任务中仍面临挑战。现有范式通常陷入两种极端:以视觉为中心的方法通过密集视觉累积导致高延迟与冗余,而以文本为中心的方法则因激进的标题生成造成细节丢失与幻觉。为弥合这一鸿沟,我们提出MM-Mem——一种基于模糊痕迹理论的金字塔式多模态记忆架构。MM-Mem将记忆分层组织为感知缓冲区、情景流与符号图式,从而能够将细粒度感知痕迹(字面信息)逐步提炼为高层语义图式(要旨信息)。此外,为控制记忆的动态构建,我们推导出语义信息瓶颈目标函数,并引入SIB-GRPO以优化记忆压缩与任务相关信息保留之间的权衡。在推理阶段,我们设计了一种熵驱动的自上而下记忆检索策略。在4个基准数据集上的大量实验证实,MM-Mem在离线和流式任务中均取得了最先进性能,展现了强大的泛化能力,并验证了受认知启发的记忆组织的有效性。代码及配套配置已公开于https://github.com/EliSpectre/MM-Mem。