We tackle the problem of learning complex, general behaviors directly in the real world. We propose an approach for robots to efficiently learn manipulation skills using only a handful of real-world interaction trajectories from many different settings. Inspired by the success of learning from large-scale datasets in the fields of computer vision and natural language, our belief is that in order to efficiently learn, a robot must be able to leverage internet-scale, human video data. Humans interact with the world in many interesting ways, which can allow a robot to not only build an understanding of useful actions and affordances but also how these actions affect the world for manipulation. Our approach builds a structured, human-centric action space grounded in visual affordances learned from human videos. Further, we train a world model on human videos and fine-tune on a small amount of robot interaction data without any task supervision. We show that this approach of affordance-space world models enables different robots to learn various manipulation skills in complex settings, in under 30 minutes of interaction. Videos can be found at https://human-world-model.github.io
翻译:我们解决了直接在现实世界中学习复杂、通用行为的难题。我们提出了一种方法,使机器人能够仅利用来自不同环境的少量真实交互轨迹,高效地学习操作技能。受计算机视觉和自然语言处理领域从大规模数据集中学习成功的启发,我们相信,为了高效学习,机器人必须能够利用互联网规模的人类视频数据。人类以多种有趣的方式与世界互动,这不仅能让机器人构建对有用动作和可供性的理解,还能理解这些动作如何影响世界以便进行操作。我们的方法构建了一个基于从人类视频中学习的视觉可供性的结构化、以人为中心的动作空间。此外,我们在人类视频上训练了一个世界模型,并在少量机器人交互数据上进行微调,而无需任何任务监督。我们表明,这种基于可供性空间的世界模型方法能使不同机器人在复杂环境中,在不到30分钟的交互内学习各种操作技能。视频可访问 https://human-world-model.github.io