Memory has become an increasingly important component of agentic systems, as these systems are expected to reason over long-term experience. However, prior work has largely focused on unimodal memory, leaving multimodal memory relatively underexplored despite its central role in real-world applications. Compared with unimodal settings, multimodal memory introduces additional challenges, including heterogeneous input integration, person-centric information alignment, and evidence aggregation across different granularities. We present PyraVid, a hierarchical multimodal memory framework inspired by Event Segmentation Theory from cognitive science. PyraVid organizes long videos into a coarse-to-fine pyramid structure, enabling structured memory access and effective evidence aggregation. It further supports structure-guided memory expansion with pruning, allowing the retrieval of related events with strong causal connectivity but low semantic similarity while reducing noise. Experiments on multiple long-video understanding benchmarks show that PyraVid consistently improves performance across datasets, model scales, and question types, highlighting the effectiveness of hierarchical multimodal memory for long-horizon reasoning.
翻译:摘要:记忆已成为智能系统日益重要的组成部分,因为这类系统需要基于长期经验进行推理。然而,先前的研究大多聚焦于单模态记忆,而多模态记忆虽在现实应用中扮演核心角色,却相对鲜有探索。相较于单模态场景,多模态记忆引入了额外挑战,包括异构输入整合、以人为中心的信息对齐,以及跨不同粒度的证据聚合。我们提出PyraVid,一种受认知科学中事件分割理论启发的分层多模态记忆框架。PyraVid将长视频组织成由粗到精的金字塔结构,实现结构化记忆访问与高效证据聚合。该框架进一步支持带剪枝的结构引导记忆扩展,允许检索具有强因果关联但语义相似度较低的相关事件,同时降低噪声。在多个长视频理解基准上的实验表明,PyraVid在不同数据集、模型规模及问题类型上均能持续提升性能,凸显了分层多模态记忆在长程推理中的有效性。