If generalist robots are to operate in truly unstructured environments, they need to be able to recognize and reason about novel objects and scenarios. Such objects and scenarios might not be present in the robot's own training data. We propose SuSIE, a method that leverages an image-editing diffusion model to act as a high-level planner by proposing intermediate subgoals that a low-level controller can accomplish. Specifically, we finetune InstructPix2Pix on video data, consisting of both human videos and robot rollouts, such that it outputs hypothetical future "subgoal" observations given the robot's current observation and a language command. We also use the robot data to train a low-level goal-conditioned policy to act as the aforementioned low-level controller. We find that the high-level subgoal predictions can utilize Internet-scale pretraining and visual understanding to guide the low-level goal-conditioned policy, achieving significantly better generalization and precision than conventional language-conditioned policies. We achieve state-of-the-art results on the CALVIN benchmark, and also demonstrate robust generalization on real-world manipulation tasks, beating strong baselines that have access to privileged information or that utilize orders of magnitude more compute and training data. The project website can be found at http://rail-berkeley.github.io/susie .
翻译:若通用机器人要在真正非结构化的环境中运行,必须能够识别并推理未见过的物体与场景,而这些物体与场景可能并未出现在机器人自身的训练数据中。我们提出SuSIE方法,利用图像编辑扩散模型作为高层规划器,通过提出低层控制器可执行的中间子目标来指导操作。具体而言,我们使用包含人类视频和机器人 rollout 数据的视频数据微调 InstructPix2Pix,使其能根据机器人当前观测和语言指令输出假想的未来“子目标”观测。同时,我们利用机器人数据训练低层目标条件策略作为前述低层控制器。实验表明,高层子目标预测可借助互联网规模预训练与视觉理解能力引导低层目标条件策略,在泛化性与精度上显著优于传统语言条件策略。我们在CALVIN基准上取得最优结果,并在真实世界操作任务中展现出强鲁棒的泛化能力,击败了具备特权信息或使用数量级更多计算与训练数据的强基线方法。项目网站详见 http://rail-berkeley.github.io/susie。