Towards Multimodal Lifelong Understanding: A Dataset and Agentic Baseline

Guo Chen,Lidong Lu,Yicheng Liu,Liangrui Dong,Lidong Zou,Jixin Lv,Zhenquan Li,Xinyi Mao,Baoqi Pei,Shihao Wang,Zhiqi Li,Karan Sapra,Fuxiao Liu,Yin-Dong Zheng,Yifei Huang,Limin Wang,Zhiding Yu,Andrew Tao,Guilin Liu,Tong Lu

While datasets for video understanding have scaled to hour-long durations, they typically consist of densely concatenated clips that differ from natural, unscripted daily life. To bridge this gap, we introduce MM-Lifelong, a dataset designed for Multimodal Lifelong Understanding. Comprising 181.1 hours of footage, it is structured across Day, Week, and Month scales to capture varying temporal densities. Extensive evaluations reveal two critical failure modes in current paradigms: end-to-end MLLMs suffer from a Working Memory Bottleneck due to context saturation, while representative agentic baselines experience Global Localization Collapse when navigating sparse, month-long timelines. To address this, we propose the Recursive Multimodal Agent (ReMA), which employs dynamic memory management to iteratively update a recursive belief state, significantly outperforming existing methods. Finally, we establish dataset splits designed to isolate temporal and domain biases, providing a rigorous foundation for future research in supervised learning and out-of-distribution generalization.

翻译：尽管视频理解数据集已扩展至小时级时长，但这些数据集通常由密集拼接的片段构成，与自然、非脚本化的日常生活存在差异。为弥合这一差距，我们提出了MM-Lifelong数据集，专为多模态终身理解而设计。该数据集包含181.1小时的影像素材，并按日、周、月三个时间尺度进行结构化组织，以捕捉不同的时间密度。大量实验评估揭示了当前范式的两个关键失效模式：端到端MLLM因上下文饱和而遭受工作记忆瓶颈；而具有代表性的智能体基线在稀疏的月尺度时间线中进行导航时，会出现全局定位崩溃。为解决这一问题，我们提出了递归多模态智能体（ReMA），该模型采用动态记忆管理机制迭代更新递归信念状态，其性能显著优于现有方法。最后，我们构建了专门用于分离时间偏差与领域偏差的数据集划分方案，为未来监督学习与分布外泛化的研究奠定了严谨的基础。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【斯坦福博士论文】面向地理空间数据的多模态与多尺度建模：时空生成式人工智能

专知会员服务

44+阅读 · 2025年12月16日

Video-LMM后训练：多模态大模型的视频推理深度解析

专知会员服务

16+阅读 · 2025年10月7日

面向具身智能的多模态数据存储与检索：综述

专知会员服务

31+阅读 · 2025年8月20日

大规模语言模型智能体的终身学习：发展路线图

专知会员服务

46+阅读 · 2025年1月16日