Memory is a critical component of robotic intelligence, as robots must rely on past observations and actions to accomplish long-horizon tasks in partially observable environments. However, existing robotic memory benchmarks still lack multimodal annotations for memory formation, provide limited task coverage and structural complexity, and remain restricted to simulation without real-world evaluation. We address this gap with RoboMemArena, a large-scale benchmark of 26 tasks, with average trajectory lengths exceeding 1,000 steps per task and 68.9% of subtasks being memory-dependent. The generation pipeline leverages a vision-language model (VLM) to design and compose subtasks, generates full trajectories through atomic functions, and provides memory-related annotations, including subtask instructions and native keyframe annotations, while paired real-world memory tasks support physical evaluation. We further design PrediMem, a dual-system VLA in which a high-level VLM planner manages a memory bank with recent and keyframe buffers and uses a predictive coding head to improve sensitivity to task dynamics. Extensive experiments on RoboMemArena show that PrediMem outperforms all baselines and provides insights into memory management, model architecture, and scaling laws for complex memory systems.
翻译:记忆是机器人智能的关键组成部分,因为在部分可观测环境中,机器人必须依赖过去的观测和动作来完成长时域任务。然而,现有的机器人记忆基准仍缺乏用于记忆形成的多模态标注,任务覆盖范围和结构复杂度有限,且仅限于仿真环境而缺乏真实世界评估。我们通过RoboMemArena填补这一空白——这是一个包含26项任务的大规模基准,平均每项任务的轨迹长度超过1000步,其中68.9%的子任务依赖于记忆。其生成流程利用视觉语言模型(VLM)设计和组合子任务,通过原子函数生成完整轨迹,并提供记忆相关标注(包括子任务指令和原生关键帧标注),同时配套的真实世界记忆任务支持物理评估。我们进一步设计了PrediMem——一种双系统视觉语言动作模型(VLA),其中高层VLM规划器管理包含近期缓冲区和关键帧缓冲区的记忆库,并通过预测编码头提升对任务动态的敏感性。在RoboMemArena上的大量实验表明,PrediMem优于所有基线方法,并为复杂记忆系统的记忆管理、模型架构和缩放规律提供了深刻见解。