Conventionally, memory in end-to-end robotic learning involves inputting a sequence of past observations into the learned policy. However, in complex multi-stage real-world tasks, the robot's memory must represent past events at multiple levels of granularity: from long-term memory that captures abstracted semantic concepts (e.g., a robot cooking dinner should remember which stages of the recipe are already done) to short-term memory that captures recent events and compensates for occlusions (e.g., a robot remembering the object it wants to pick up once its arm occludes it). In this work, our main insight is that an effective memory architecture for long-horizon robotic control should combine multiple modalities to capture these different levels of abstraction. We introduce Multi-Scale Embodied Memory (MEM), an approach for mixed-modal long-horizon memory in robot policies. MEM combines video-based short-horizon memory, compressed via a video encoder, with text-based long-horizon memory. Together, they enable robot policies to perform tasks that span up to fifteen minutes, like cleaning up a kitchen, or preparing a grilled cheese sandwich. Additionally, we find that memory enables MEM policies to intelligently adapt manipulation strategies in-context.
翻译:传统上,端到端机器人学习中的记忆涉及将一系列过去观测输入到学习策略中。然而,在复杂的多阶段现实世界任务中,机器人的记忆必须在多个粒度层级上表征过去事件:从捕获抽象语义概念的长期记忆(例如,正在烹饪晚餐的机器人应记住食谱的哪些阶段已完成)到捕获近期事件并补偿遮挡的短期记忆(例如,机器人需记住其手臂遮挡前想要拾取的物体)。本工作的核心见解是,用于长时程机器人控制的有效记忆架构应结合多种模态以捕捉这些不同抽象层级。我们提出了多尺度具身记忆(MEM),一种用于机器人策略中混合模态长时程记忆的方法。MEM通过视频编码器压缩的视频型短时程记忆与文本型长时程记忆相结合。二者共同使机器人策略能够执行长达十五分钟的任务,例如清理厨房或制作烤奶酪三明治。此外,我们发现记忆使MEM策略能够在上下文中智能地调整操作策略。