While datasets for video understanding have scaled to hour-long durations, they typically consist of densely concatenated clips that differ from natural, unscripted daily life. To bridge this gap, we introduce MM-Lifelong, a dataset designed for Multimodal Lifelong Understanding. Comprising 181.1 hours of footage, it is structured across Day, Week, and Month scales to capture varying temporal densities. Extensive evaluations reveal two critical failure modes in current paradigms: end-to-end MLLMs suffer from a Working Memory Bottleneck due to context saturation, while representative agentic baselines experience Global Localization Collapse when navigating sparse, month-long timelines. To address this, we propose the Recursive Multimodal Agent (ReMA), which employs dynamic memory management to iteratively update a recursive belief state, significantly outperforming existing methods. Finally, we establish dataset splits designed to isolate temporal and domain biases, providing a rigorous foundation for future research in supervised learning and out-of-distribution generalization.
翻译:尽管视频理解数据集已扩展至小时级时长,但这些数据集通常由密集拼接的片段构成,与自然、非脚本化的日常生活存在差异。为弥合这一差距,我们提出了MM-Lifelong数据集,专为多模态终身理解而设计。该数据集包含181.1小时的影像素材,并按日、周、月三个时间尺度进行结构化组织,以捕捉不同的时间密度。大量实验评估揭示了当前范式的两个关键失效模式:端到端MLLM因上下文饱和而遭受工作记忆瓶颈;而具有代表性的智能体基线在稀疏的月尺度时间线中进行导航时,会出现全局定位崩溃。为解决这一问题,我们提出了递归多模态智能体(ReMA),该模型采用动态记忆管理机制迭代更新递归信念状态,其性能显著优于现有方法。最后,我们构建了专门用于分离时间偏差与领域偏差的数据集划分方案,为未来监督学习与分布外泛化的研究奠定了严谨的基础。