Scalable learning of humanoid robots is crucial for their deployment in real-world applications. While traditional approaches primarily rely on reinforcement learning or teleoperation to achieve whole-body control, they are often limited by the diversity of simulated environments and the high costs of demonstration collection. In contrast, human videos are ubiquitous and present an untapped source of semantic and motion information that could significantly enhance the generalization capabilities of humanoid robots. This paper introduces Humanoid-X, a large-scale dataset of over 20 million humanoid robot poses with corresponding text-based motion descriptions, designed to leverage this abundant data. Humanoid-X is curated through a comprehensive pipeline: data mining from the Internet, video caption generation, motion retargeting of humans to humanoid robots, and policy learning for real-world deployment. With Humanoid-X, we further train a large humanoid model, UH-1, which takes text instructions as input and outputs corresponding actions to control a humanoid robot. Extensive simulated and real-world experiments validate that our scalable training approach leads to superior generalization in text-based humanoid control, marking a significant step toward adaptable, real-world-ready humanoid robots.
翻译:人形机器人的可扩展学习对其在现实世界中的应用部署至关重要。传统方法主要依赖强化学习或遥操作来实现全身控制,但往往受限于模拟环境的多样性及演示数据采集的高昂成本。相比之下,人类视频资源丰富,蕴含着尚未充分利用的语义与运动信息,可显著提升人形机器人的泛化能力。本文提出Humanoid-X——一个包含超过2000万个人形机器人姿态及对应文本运动描述的大规模数据集,旨在利用这一海量数据资源。Humanoid-X通过系统化流程构建:互联网数据挖掘、视频描述生成、从人类到人形机器人的运动重定向,以及面向现实部署的策略学习。基于Humanoid-X,我们进一步训练了大规模人形模型UH-1,该模型以文本指令为输入,输出相应动作以控制人形机器人。大量仿真与真实环境实验表明,我们的可扩展训练方法在基于文本的人形控制中实现了卓越的泛化性能,标志着向适应性强、具备现实应用能力的人形机器人迈出了重要一步。