Comprehending extended audiovisual experiences remains challenging for computational systems, particularly temporal integration and cross-modal associations fundamental to human episodic memory. We introduce HippoMM, a computational cognitive architecture that maps hippocampal mechanisms to solve these challenges. Rather than relying on scaling or architectural sophistication, HippoMM implements three integrated components: (i) Episodic Segmentation detects audiovisual input changes to split videos into discrete episodes, mirroring dentate gyrus pattern separation; (ii) Memory Consolidation compresses episodes into summaries with key features preserved, analogous to hippocampal memory formation; and (iii) Hierarchical Memory Retrieval first searches semantic summaries, then escalates via temporal window expansion around seed segments for cross-modal queries, mimicking CA3 pattern completion. These components jointly create an integrated system exceeding the sum of its parts. On our HippoVlog benchmark testing associative memory, HippoMM achieves state-of-the-art 78.2% accuracy while operating 5x faster than retrieval-augmented baselines. Our results demonstrate that cognitive architectures provide blueprints for next-generation multimodal understanding. The code and benchmark dataset are publicly available at https://github.com/linyueqian/HippoMM.
翻译:理解长时间的视听体验对计算系统而言仍是一项挑战,尤其是时间整合与跨模态关联——这些正是人类情景记忆的基础。我们提出了HippoMM,一种将海马体机制映射到计算系统的认知架构,以解决这些挑战。HippoMM不依赖于模型扩展或架构复杂性,而是实现了三个集成组件:(i) 情景分割——检测视听输入变化,将视频分割为离散情景,模拟齿状回的分离模式;(ii) 记忆巩固——将情景压缩为保留关键特征的摘要,类似于海马体的记忆形成过程;(iii) 层次化记忆检索——先搜索语义摘要,再通过围绕种子片段的时间窗口扩展进行跨模态查询,模拟CA3区的模式完成机制。这些组件共同构建了一个超越各部分简单叠加的集成系统。在用于测试关联记忆的HippoVlog基准上,HippoMM达到了78.2%的最优准确率,同时运行速度比检索增强基线快5倍。我们的结果表明,认知架构为下一代多模态理解提供了设计蓝图。代码和基准数据集已在https://github.com/linyueqian/HippoMM上公开。