Human demonstration videos are a widely available data source for robot learning and an intuitive user interface for expressing desired behavior. However, directly extracting reusable robot manipulation skills from unstructured human videos is challenging due to the big embodiment difference and unobserved action parameters. To bridge this embodiment gap, this paper introduces XSkill, an imitation learning framework that 1) discovers a cross-embodiment representation called skill prototypes purely from unlabeled human and robot manipulation videos, 2) transfers the skill representation to robot actions using conditional diffusion policy, and finally, 3) composes the learned skill to accomplish unseen tasks specified by a human prompt video. Our experiments in simulation and real-world environments show that the discovered skill prototypes facilitate both skill transfer and composition for unseen tasks, resulting in a more general and scalable imitation learning framework. The benchmark, code, and qualitative results are on https://xskill.cs.columbia.edu/
翻译:摘要:人类示范视频是机器人学习中广泛可用的数据源,也是表达期望行为的直观用户界面。然而,由于巨大的具身差异和未观测的动作参数,直接从非结构化人类视频中提取可复用的机器人操作技能极具挑战性。为弥合这一具身鸿沟,本文提出XSkill框架,一种模仿学习框架,其核心步骤为:1)仅从无标签的人类与机器人操作视频中发现称为技能原型的跨具身表征;2)利用条件扩散策略将技能表征迁移至机器人动作;3)组合已习得技能以完成由人类提示视频指定的未见任务。我们在仿真与真实环境中的实验表明,所发现的技能原型能够促进未见任务的技能迁移与组合,从而构建更通用且可扩展的模仿学习框架。基准测试、代码及定性结果详见https://xskill.cs.columbia.edu/