ULTRA: Unified Multimodal Control for Autonomous Humanoid Whole-Body Loco-Manipulation

Achieving autonomous and versatile whole-body loco-manipulation remains a central barrier to making humanoids practically useful. Yet existing approaches are fundamentally constrained: retargeted data are often scarce or low-quality; methods struggle to scale to large skill repertoires; and, most importantly, they rely on tracking predefined motion references rather than generating behavior from perception and high-level task specifications. To address these limitations, we propose ULTRA, a unified framework with two key components. First, we introduce a physics-driven neural retargeting algorithm that translates large-scale motion capture to humanoid embodiments while preserving physical plausibility for contact-rich interactions. Second, we learn a unified multimodal controller that supports both dense references and sparse task specifications, under sensing ranging from accurate motion-capture state to noisy egocentric visual inputs. We distill a universal tracking policy into this controller, compress motor skills into a compact latent space, and apply reinforcement learning finetuning to expand coverage and improve robustness under out-of-distribution scenarios. This enables coordinated whole-body behavior from sparse intent without test-time reference motions. We evaluate ULTRA in simulation and on a real Unitree G1 humanoid. Results show that ULTRA generalizes to autonomous, goal-conditioned whole-body loco-manipulation from egocentric perception, consistently outperforming tracking-only baselines with limited skills.

翻译：实现自主且通用的全身运动操控，仍然是使仿人机器人具备实用价值的核心障碍。然而，现有方法存在根本性局限：重定向数据往往稀缺或质量低下；方法难以扩展至大规模技能库；最重要的是，它们依赖跟踪预定义的运动参考，而非根据感知和高层任务指令生成行为。为应对这些局限，我们提出了ULTRA，一个包含两个关键组件的统一框架。首先，我们引入了一种物理驱动的神经重定向算法，该算法将大规模运动捕捉数据转换到仿人机器人形态，同时为接触密集的交互保持物理合理性。其次，我们学习了一个统一的多模态控制器，支持从密集参考到稀疏任务指令，感知输入范围从精确的运动捕捉状态到嘈杂的自我中心视觉输入。我们将一个通用的跟踪策略蒸馏到该控制器中，将运动技能压缩到一个紧凑的潜在空间，并应用强化学习微调以扩大覆盖范围并提升在分布外场景下的鲁棒性。这使得无需测试时的参考运动，仅凭稀疏意图即可产生协调的全身行为。我们在仿真和真实的Unitree G1仿人机器人上评估了ULTRA。结果表明，ULTRA能够泛化至基于自我中心感知的、目标条件化的自主全身运动操控，其性能始终优于仅具备有限技能的纯跟踪基线方法。