We explore how intermediate policy representations can facilitate generalization by providing guidance on how to perform manipulation tasks. Existing representations such as language, goal images, and trajectory sketches have been shown to be helpful, but these representations either do not provide enough context or provide over-specified context that yields less robust policies. We propose conditioning policies on affordances, which capture the pose of the robot at key stages of the task. Affordances offer expressive yet lightweight abstractions, are easy for users to specify, and facilitate efficient learning by transferring knowledge from large internet datasets. Our method, RT-Affordance, is a hierarchical model that first proposes an affordance plan given the task language, and then conditions the policy on this affordance plan to perform manipulation. Our model can flexibly bridge heterogeneous sources of supervision including large web datasets and robot trajectories. We additionally train our model on cheap-to-collect in-domain affordance images, allowing us to learn new tasks without collecting any additional costly robot trajectories. We show on a diverse set of novel tasks how RT-Affordance exceeds the performance of existing methods by over 50%, and we empirically demonstrate that affordances are robust to novel settings. Videos available at https://snasiriany.me/rt-affordance
翻译:我们探讨了中间策略表示如何通过提供执行操作任务的指导来促进泛化。现有表示(如语言、目标图像和轨迹草图)已被证明具有帮助作用,但这些表示要么未能提供足够的上下文,要么提供了过度指定的上下文,从而导致策略的鲁棒性降低。我们提出以可供性作为策略的条件,可供性捕捉了任务关键阶段机器人的姿态。可供性提供了表达力强且轻量级的抽象,易于用户指定,并能够通过从大型互联网数据集中迁移知识来促进高效学习。我们的方法 RT-Affordance 是一个分层模型:首先根据任务语言生成可供性规划,然后以该可供性规划为条件执行操作策略。我们的模型能够灵活桥接异构的监督源,包括大型网络数据集和机器人轨迹。此外,我们还利用易于收集的领域内可供性图像对模型进行训练,从而无需收集任何额外昂贵的机器人轨迹即可学习新任务。我们在多样化的新任务集上展示了 RT-Affordance 的性能超越现有方法超过 50%,并通过实验证明可供性对新场景具有鲁棒性。视频可见于 https://snasiriany.me/rt-affordance