Training manipulation policies for humanoid robots with diverse data enhances their robustness and generalization across tasks and platforms. However, learning solely from robot demonstrations is labor-intensive, requiring expensive tele-operated data collection which is difficult to scale. This paper investigates a more scalable data source, egocentric human demonstrations, to serve as cross-embodiment training data for robot learning. We mitigate the embodiment gap between humanoids and humans from both the data and modeling perspectives. We collect an egocentric task-oriented dataset (PH2D) that is directly aligned with humanoid manipulation demonstrations. We then train a human-humanoid behavior policy, which we term Human Action Transformer (HAT). The state-action space of HAT is unified for both humans and humanoid robots and can be differentiably retargeted to robot actions. Co-trained with smaller-scale robot data, HAT directly models humanoid robots and humans as different embodiments without additional supervision. We show that human data improves both generalization and robustness of HAT with significantly better data collection efficiency. Code and data: https://human-as-robot.github.io/
翻译:利用多样化数据训练人形机器人的操作策略,可增强其跨任务和跨平台的鲁棒性与泛化能力。然而,仅从机器人演示中学习需要耗费大量人力,依赖于昂贵的遥操作数据收集,难以扩展规模。本文研究了一种更具可扩展性的数据源——以自我为中心的人类演示,将其作为机器人学习的跨具身训练数据。我们从数据和建模两个角度,缩小了人形机器人与人类之间的具身鸿沟。我们收集了一个与机器人操作演示直接对齐的以自我为中心的任务导向数据集(PH2D)。随后,我们训练了一个人类-人形行为策略,称之为人类动作变换器(HAT)。HAT的状态-动作空间对人类和人形机器人是统一的,并且可以可微分地重定向至机器人动作。通过与较小规模的机器人数据协同训练,HAT直接将人形机器人和人类建模为不同的具身形式,无需额外监督。我们证明,人类数据显著提升了HAT的泛化能力和鲁棒性,同时数据收集效率大幅提高。代码与数据:https://human-as-robot.github.io/