RoboForge: Physically Optimized Text-guided Whole-Body Locomotion for Humanoids

While generative models have become effective at producing human-like motions from text, transferring these motions to humanoid robots for physical execution remains challenging. Existing pipelines are often limited by retargeting, where kinematic quality is undermined by physical infeasibility, contact-transition errors, and the high cost of real-world dynamical data. We present a unified latent-driven framework that bridges natural language and whole-body humanoid locomotion through a retarget-free, physics-optimized pipeline. Rather than treating generation and control as separate stages, our key insight is to couple them bidirectionally under physical constraints.We introduce a Physical Plausibility Optimization (PP-Opt) module as the coupling interface. In the forward direction, PP-Opt refines a teacher-student distillation policy with a plausibility-centric reward to suppress artifacts such as floating, skating, and penetration. In the backward direction, it converts reward-optimized simulation rollouts into high-quality explicit motion data, which is used to fine-tune the motion generator toward a more physically plausible latent distribution. This bidirectional design forms a self-improving cycle: the generator learns a physically grounded latent space, while the controller learns to execute latent-conditioned behaviors with dynamical integrity.Extensive experiments on the Unitree G1 humanoid show that our bidirectional optimization improves tracking accuracy and success rates. Across IsaacLab and MuJoCo, the implicit latent-driven pipeline consistently outperforms conventional explicit retargeting baselines in both precision and stability. By coupling diffusion-based motion generation with physical plausibility optimization, our framework provides a practical path toward deployable text-guided humanoid intelligence.

翻译：尽管生成模型在从文本生成类人运动方面已取得显著成效，但将这些运动迁移至人形机器人以实现物理执行仍面临挑战。现有流程通常受限于重定向过程：运动学质量因物理不可行性、接触过渡错误以及真实世界动力学数据的高昂成本而受损。我们提出一种统一的隐空间驱动框架，通过免重定向、物理优化的流程，将自然语言与全身人形机器人运动相连接。我们的核心洞见在于：不是将生成与控制视为独立阶段，而是在物理约束下将二者进行双向耦合。我们引入物理可行性优化模块（PP-Opt）作为耦合接口。在前向方向上，PP-Opt通过基于可行性的奖励精炼师生蒸馏策略，抑制漂浮、滑移与穿透等伪影；在后向方向上，它将经奖励优化的仿真轨迹转化为高质量显式运动数据，用于微调运动生成器，使其隐空间分布更符合物理可行性。这种双向设计形成了一个自我改进的循环：生成器学习物理约束的隐空间，而控制器则执行具有动力学完整性的隐条件行为。在Unitree G1人形机器人上的大量实验表明，我们的双向优化提升了跟踪精度与成功率。在IsaacLab与MuJoCo平台上，隐式隐空间驱动流程在精度与稳定性上一致优于传统的显式重定向基线方法。通过将基于扩散的运动生成与物理可行性优化相结合，本框架为可部署的文本引导人形智能提供了实用路径。