In this work, we present an approach to construct a video-based robot policy capable of reliably executing diverse tasks across different robots and environments from few video demonstrations without using any action annotations. Our method leverages images as a task-agnostic representation, encoding both the state and action information, and text as a general representation for specifying robot goals. By synthesizing videos that ``hallucinate'' robot executing actions and in combination with dense correspondences between frames, our approach can infer the closed-formed action to execute to an environment without the need of any explicit action labels. This unique capability allows us to train the policy solely based on RGB videos and deploy learned policies to various robotic tasks. We demonstrate the efficacy of our approach in learning policies on table-top manipulation and navigation tasks. Additionally, we contribute an open-source framework for efficient video modeling, enabling the training of high-fidelity policy models with four GPUs within a single day.
翻译:在本工作中,我们提出了一种构建基于视频的机器人策略的方法,该方法能够从少量视频演示中,无需使用任何动作标注,即可在多种机器人和环境间可靠地执行多样化的任务。我们的方法利用图像作为任务无关的表示,同时编码状态和动作信息,并将文本作为指定机器人目标的通用表示。通过合成"虚构"机器人执行动作的视频,并结合帧间的密集对应,我们的方法可以推断出在环境中执行的闭式动作,无需任何显式动作标签。这一独特能力使我们能够仅基于RGB视频训练策略,并将学习到的策略部署到各种机器人任务中。我们通过在桌面操作和导航任务中学习策略,展示了本方法的有效性。此外,我们贡献了一个用于高效视频建模的开源框架,使得在四块GPU上一天内即可训练高保真策略模型。