Humans routinely rely on memory to perform tasks, yet most robot policies lack this capability; our goal is to endow robot policies with the same ability. Naively conditioning on long observation histories is computationally expensive and brittle under covariate shift, while indiscriminate subsampling of history leads to irrelevant or redundant information. We propose a hierarchical policy framework, where the high-level policy is trained to select and track previous relevant keyframes from its experience. The high-level policy uses selected keyframes and the most recent frames when generating text instructions for a low-level policy to execute. This design is compatible with existing vision-language-action (VLA) models and enables the system to efficiently reason over long-horizon dependencies. In our experiments, we finetune Qwen2.5-VL-7B-Instruct and $\pi_{0.5}$ as the high-level and low-level policies respectively, using demonstrations supplemented with minimal language annotations. Our approach, MemER, outperforms prior methods on three real-world long-horizon robotic manipulation tasks that require minutes of memory. Videos and code can be found at https://jen-pan.github.io/memer/.
翻译:人类通常依赖记忆来执行任务,然而大多数机器人策略缺乏这种能力;我们的目标是为机器人策略赋予同样的能力。简单地以长观测历史为条件在计算上代价高昂且在协变量偏移下脆弱,而对历史进行不加选择的子采样则会导致信息不相关或冗余。我们提出了一种分层策略框架,其中高层策略被训练用于从其经验中选择并跟踪先前相关的关键帧。高层策略在生成供低层策略执行的文本指令时,会使用选定的关键帧以及最近的帧。这种设计与现有的视觉-语言-动作(VLA)模型兼容,并使系统能够高效地对长时程依赖关系进行推理。在我们的实验中,我们分别微调Qwen2.5-VL-7B-Instruct和$\pi_{0.5}$作为高层和低层策略,使用了辅以最少语言标注的演示。我们的方法MemER在三个需要数分钟记忆的真实世界长时程机器人操作任务上优于先前的方法。视频和代码可在 https://jen-pan.github.io/memer/ 找到。