Humanoid whole-body control has made significant progress in recent years, yet existing approaches remain limited to few-skill policies with heavy reward engineering, or motion trackers that are difficult to extend to new input modalities. We argue that the key to general-purpose humanoid control is to build a scalable brain, a module capable of reasoning with diverse conditioning modalities, atop a reactive motion tracking cerebellum, mirroring the hierarchical structure of biological motor systems. Two challenges arise in realizing this vision: acquiring a vast amount of high-quality data to achieve general purpose control, and equipping the generator with the capability to condition on compositional, extensible multi-modal inputs. We present OMG, which addresses these challenges with a meticulous data curation, filtering and labeling pipeline, as well as a diffusion-based motion generation backbone that conditions on language, audio, and human reference motions. Extensive experiments validate OMG as an omni-modal whole-body controller exhibiting state-of-the-art performance, model scaling behavior and efficient adaptation to new distributions and modalities, marking a concrete step toward foundation models for humanoid robots.
翻译:近年来,人形机器人全身控制取得了显著进展,但现有方法仍受限于依赖大量奖励工程设计的少技能策略,或难以扩展至新型输入模态的运动追踪器。我们认为,通用型人形控制的关键在于构建一个可扩展的“大脑”——该模块能够基于多样化条件模态进行推理,并连接至一个反应式运动追踪“小脑”,从而模拟生物运动系统的层级结构。实现这一愿景面临两大挑战:获取海量高质量数据以达成通用控制,以及赋予生成器处理组合式、可扩展的多模态输入的能力。我们提出OMG,通过精细的数据整理、过滤与标注流程,以及基于扩散模型的运动生成主干(支持语言、音频和人类参考动作条件),有效应对上述挑战。大量实验验证OMG作为全模态全身控制器,展现出最先进的性能、模型扩展行为,以及对新分布与新模态的高效适应能力,这标志着向人形机器人基础模型迈出了坚实一步。