We propose a robot learning method for communicating, planning, and executing a wide range of tasks, dubbed This&That. We achieve robot planning for general tasks by leveraging the power of video generative models trained on internet-scale data containing rich physical and semantic context. In this work, we tackle three fundamental challenges in video-based planning: 1) unambiguous task communication with simple human instructions, 2) controllable video generation that respects user intents, and 3) translating visual planning into robot actions. We propose language-gesture conditioning to generate videos, which is both simpler and clearer than existing language-only methods, especially in complex and uncertain environments. We then suggest a behavioral cloning design that seamlessly incorporates the video plans. This&That demonstrates state-of-the-art effectiveness in addressing the above three challenges, and justifies the use of video generation as an intermediate representation for generalizable task planning and execution. Project website: https://cfeng16.github.io/this-and-that/.
翻译:本文提出一种名为This&That的机器人学习方法,用于任务传达、规划与执行。该方法通过利用在互联网规模数据上训练的视频生成模型(这些数据蕴含丰富的物理与语义上下文),实现通用任务的机器人规划。本研究攻克了基于视频规划的三大核心挑战:1)通过简洁的人类指令实现无歧义的任务传达;2)生成符合用户意图的可控视频;3)将视觉规划转化为机器人动作。我们提出语言-手势协同条件生成方法,相较于现有纯语言方法,该方案在复杂不确定环境中能更简洁清晰地生成视频。进一步提出一种无缝融合视频规划的行为克隆框架。This&That在解决上述三大挑战方面展现出最先进的性能,验证了将视频生成作为通用任务规划与执行的中间表征的有效性。项目网站:https://cfeng16.github.io/this-and-that/。