A key challenge in manipulation is learning a policy that can robustly generalize to diverse visual environments. A promising mechanism for learning robust policies is to leverage video generative models, which are pretrained on large-scale datasets of internet videos. In this paper, we propose a visuomotor policy learning framework that fine-tunes a video diffusion model on human demonstrations of a given task. At test time, we generate an example of an execution of the task conditioned on images of a novel scene, and use this synthesized execution directly to control the robot. Our key insight is that using common tools allows us to effortlessly bridge the embodiment gap between the human hand and the robot manipulator. We evaluate our approach on four tasks of increasing complexity and demonstrate that harnessing internet-scale generative models allows the learned policy to achieve a significantly higher degree of generalization than existing behavior cloning approaches.
翻译:操作任务中的一个关键挑战是学习一种能够稳健地泛化到多样化视觉环境的策略。学习稳健策略的一种有前景的机制是利用视频生成模型,这些模型已在互联网视频的大规模数据集上进行了预训练。在本文中,我们提出了一种视觉运动策略学习框架,该框架在给定任务的人类演示数据上对视频扩散模型进行微调。在测试时,我们根据新场景的图像生成一个任务执行的示例,并直接使用这个合成的执行过程来控制机器人。我们的核心见解是,使用通用工具能够让我们毫不费力地弥合人手与机器人操作器之间的具身性鸿沟。我们在四个复杂度递增的任务上评估了我们的方法,并证明利用互联网规模的生成模型能使学习到的策略实现比现有行为克隆方法显著更高的泛化程度。