Building a general-purpose whole-body controller is essential for enabling diverse motion capabilities in humanoid robots across a wide range of downstream tasks, including locomotion and loco-manipulation. Different tasks rely on distinct motion reference modalities: locomotion primarily depends on coordinated robot joint trajectories, whereas manipulation requires precise end-effector trajectory tracking. Existing methods often overlook the representational mismatch between dense robot joint angles and sparse end-effector poses. To address this, we propose Multi-Modal Mimic (M3imic), a versatile multi-modal whole-body control framework that unifies heterogeneous motion reference modalities, including robot joint angles, human pose trajectories, and end-effector poses, using modality-specific encoders to map them into a shared latent space. Leveraging large-scale reinforcement learning in the simulator, we train a single policy that achieves sim-to-real transfer across multiple motion reference modalities without modality-specific retraining. Extensive simulation and real-world experiments on the Unitree G1 robot are conducted to evaluate the proposed framework. In simulation, the policy achieves a peak success rate of 98.42\% on an unseen test dataset, demonstrating its exceptional generalization capability. The code is available at https://github.com/Renforce-Dynamics/MultiModalWBC
翻译:构建通用全身控制器对于赋予人形机器人完成包括 locomotion 和 loco-manipulation 在内的广泛下游任务所需多样化运动能力至关重要。不同任务依赖于不同的运动参考模态:locomotion 主要依赖协调的机器人关节轨迹,而 manipulation 则需要精确的末端执行器轨迹跟踪。现有方法往往忽略了密集的机器人关节角度与稀疏末端执行器位姿之间的表征不匹配问题。为解决这一问题,我们提出 Multi-Modal Mimic (M3imic),这是一个通用多模态全身控制框架,它利用模态特定编码器将异构的运动参考模态(包括机器人关节角度、人体姿态轨迹和末端执行器位姿)映射到共享的隐空间中,从而实现统一。通过在仿真器中进行大规模强化学习,我们训练出单一策略,该策略无需针对特定模态重新训练即可在多种运动参考模态间实现 sim-to-real 迁移。利用 Unitree G1 机器人进行了广泛的仿真和真实世界实验以评估所提出的框架。在仿真中,该策略在未见过的测试数据集上达到了 98.42% 的峰值成功率,展现其卓越的泛化能力。代码已开源在 https://github.com/Renforce-Dynamics/MultiModalWBC