Learning Video-Conditioned Policies for Unseen Manipulation Tasks

The ability to specify robot commands by a non-expert user is critical for building generalist agents capable of solving a large variety of tasks. One convenient way to specify the intended robot goal is by a video of a person demonstrating the target task. While prior work typically aims to imitate human demonstrations performed in robot environments, here we focus on a more realistic and challenging setup with demonstrations recorded in natural and diverse human environments. We propose Video-conditioned Policy learning (ViP), a data-driven approach that maps human demonstrations of previously unseen tasks to robot manipulation skills. To this end, we learn our policy to generate appropriate actions given current scene observations and a video of the target task. To encourage generalization to new tasks, we avoid particular tasks during training and learn our policy from unlabelled robot trajectories and corresponding robot videos. Both robot and human videos in our framework are represented by video embeddings pre-trained for human action recognition. At test time we first translate human videos to robot videos in the common video embedding space, and then use resulting embeddings to condition our policies. Notably, our approach enables robot control by human demonstrations in a zero-shot manner, i.e., without using robot trajectories paired with human instructions during training. We validate our approach on a set of challenging multi-task robot manipulation environments and outperform state of the art. Our method also demonstrates excellent performance in a new challenging zero-shot setup where no paired data is used during training.

翻译：非专业用户指定机器人指令的能力对于构建能够解决大量任务的通用智能体至关重要。一种便捷的指定预期机器人目标的方式是通过展示目标任务的人类演示视频。现有工作通常旨在模仿在机器人环境中执行的人类演示，而我们则聚焦于一个更现实且具有挑战性的设置：演示视频记录于自然多样的人类环境中。我们提出视频条件化策略学习（ViP），一种数据驱动方法，能将以前未见任务的人类演示映射到机器人操作技能。为此，我们学习策略根据当前场景观察和目标任务视频生成适当动作。为促进对新任务的泛化，我们在训练中避免特定任务，并从无标注的机器人轨迹及对应机器人视频中学习策略。我们框架中的机器人和人类视频均由预训练用于人类动作识别的视频嵌入表示。测试时，我们首先将人类视频翻译为公共视频嵌入空间中的机器人视频，随后利用所得嵌入来调节我们的策略。值得注意的是，我们的方法支持以零样本方式通过人类演示控制机器人，即在训练过程中不使用与人类指令配对的机器人轨迹。我们在多个具有挑战性的多任务机器人操作环境中验证了该方法，并超越了现有最优水平。此外，我们的方法在一种新颖的挑战性零样本设置中展现出卓越性能，该设置训练阶段完全不使用配对数据。