Enabling humanoid robots to follow free-form language commands is critical for seamless human-robot interaction, collaborative task execution, and general-purpose embodied intelligence. While recent advances have improved low-level humanoid locomotion and robot manipulation, language-conditioned whole-body control remains a significant challenge. Existing methods are often limited to simple instructions and sacrifice either motion diversity or physical plausibility. To address this, we introduce Humanoid-LLA, a Large Language Action Model that maps expressive language commands to physically executable whole-body actions for humanoid robots. Our approach integrates three core components: a unified motion vocabulary that aligns human and humanoid motion primitives into a shared discrete space; a vocabulary-directed controller distilled from a privileged policy to ensure physical feasibility; and a physics-informed fine-tuning stage using reinforcement learning with dynamics-aware rewards to enhance robustness and stability. Extensive evaluations in simulation and on real-world Unitree G1 and Booster T1 humanoids show that Humanoid-LLA delivers strong language generalization while maintaining high physical fidelity, outperforming existing language-conditioned controllers in motion naturalness, stability, and execution success rate.
翻译:使人形机器人能够遵循自由形式的语言指令,对于无缝人机交互、协作任务执行以及通用具身智能至关重要。尽管近期进展提升了低级别的人形机器人运动与机器人操控能力,但语言条件化的全身控制仍是一项重大挑战。现有方法通常局限于简单指令,且要么牺牲运动多样性,要么损害物理合理性。为解决这一问题,我们提出了Humanoid-LLA,一种大型语言动作模型,能够将表达性语言指令映射为实际可执行的全身动作,适用于人形机器人。我们的方法整合了三大核心组成部分:一个统一运动词汇表,将人与机器人运动基元对齐至共享离散空间;一个从特权策略中提取的词汇导向控制器,确保物理可行性;以及一个基于强化学习与动力学感知奖励的物理信息微调阶段,用于增强鲁棒性与稳定性。在仿真及真实Unitree G1与Booster T1人形机器人上的广泛评估表明,Humanoid-LLA在保持高物理逼真度的同时,展现出卓越的语言泛化能力,在运动自然度、稳定性及执行成功率上均优于现有语言条件化控制器。