FRoM-W1：迈向基于语言指令的通用仿人机器人全身控制 (FRoM-W1: Towards General Humanoid Whole-Body Control with Language Instructions)

Peng Li,Zihan Zhuang,Yangfan Gao,Yi Dong,Sixian Li,Changhao Jiang,Shihan Dou,Zhiheng Xi,Enyu Zhou,Jixuan Huang,Hui Li,Jingjing Gong,Xingjun Ma,Tao Gui,Zuxuan Wu,Qi Zhang,Xuanjing Huang,Yu-Gang Jiang,Xipeng Qiu

from arxiv, Project Page: https://openmoss.github.io/FRoM-W1

Humanoid robots are capable of performing various actions such as greeting, dancing and even backflipping. However, these motions are often hard-coded or specifically trained, which limits their versatility. In this work, we present FRoM-W1, an open-source framework designed to achieve general humanoid whole-body motion control using natural language. To universally understand natural language and generate corresponding motions, as well as enable various humanoid robots to stably execute these motions in the physical world under gravity, FRoM-W1 operates in two stages: (a) H-GPT: utilizing massive human data, a large-scale language-driven human whole-body motion generation model is trained to generate diverse natural behaviors. We further leverage the Chain-of-Thought technique to improve the model's generalization in instruction understanding. (b) H-ACT: After retargeting generated human whole-body motions into robot-specific actions, a motion controller that is pretrained and further fine-tuned through reinforcement learning in physical simulation enables humanoid robots to accurately and stably perform corresponding actions. It is then deployed on real robots via a modular simulation-to-reality module. We extensively evaluate FRoM-W1 on Unitree H1 and G1 robots. Results demonstrate superior performance on the HumanML3D-X benchmark for human whole-body motion generation, and our introduced reinforcement learning fine-tuning consistently improves both motion tracking accuracy and task success rates of these humanoid robots. We open-source the entire FRoM-W1 framework and hope it will advance the development of humanoid intelligence.

翻译：仿人机器人能够执行诸如打招呼、跳舞甚至后空翻等多种动作。然而，这些动作通常是硬编码或专门训练的，限制了其通用性。在本工作中，我们提出了FRoM-W1，这是一个旨在利用自然语言实现通用仿人机器人全身运动控制的开源框架。为了通用地理解自然语言并生成相应动作，同时使各种仿人机器人能够在重力作用下的物理世界中稳定执行这些动作，FRoM-W1分两个阶段运行：(a) H-GPT：利用海量人体数据，训练一个大规模语言驱动的全身人体运动生成模型，以生成多样化的自然行为。我们进一步利用思维链技术来提升模型在指令理解方面的泛化能力。(b) H-ACT：将生成的全身人体运动重定向为机器人特定动作后，一个在物理仿真中通过强化学习进行预训练并进一步微调的运动控制器，使仿人机器人能够准确且稳定地执行相应动作。随后通过模块化的仿真到现实模块将其部署到真实机器人上。我们在Unitree H1和G1机器人上对FRoM-W1进行了广泛评估。结果表明，其在HumanML3D-X基准测试上对人体全身运动生成具有优越性能，并且我们引入的强化学习微调持续提升了这些仿人机器人的运动跟踪精度和任务成功率。我们开源了整个FRoM-W1框架，希望它能推动仿人智能的发展。