$Ψ_0$: An Open Foundation Model Towards Universal Humanoid Loco-Manipulation

Songlin Wei,Hongyi Jing,Boqian Li,Zhenyu Zhao,Jiageng Mao,Zhenhao Ni,Sicheng He,Jie Liu,Xiawei Liu,Kaidi Kang,Sheng Zang,Weiduo Yuan,Marco Pavone,Di Huang,Yue Wang

We introduce $Ψ_0$ (Psi-Zero), an open foundation model to address challenging humanoid loco-manipulation tasks. While existing approaches often attempt to address this fundamental problem by co-training on large and diverse human and humanoid data, we argue that this strategy is suboptimal due to the fundamental kinematic and motion disparities between humans and humanoid robots. Therefore, data efficiency and model performance remain unsatisfactory despite the considerable data volume. To address this challenge, \ours\;decouples the learning process to maximize the utility of heterogeneous data sources. Specifically, we propose a staged training paradigm with different learning objectives: First, we autoregressively pre-train a VLM backbone on large-scale egocentric human videos to acquire generalizable visual-action representations. Then, we post-train a flow-based action expert on high-quality humanoid robot data to learn precise robot joint control. Our research further identifies a critical yet often overlooked data recipe: in contrast to approaches that scale with noisy Internet clips or heterogeneous cross-embodiment robot datasets, we demonstrate that pre-training on high-quality egocentric human manipulation data followed by post-training on domain-specific real-world humanoid trajectories yields superior performance. Extensive real-world experiments demonstrate that \ours\ achieves the best performance using only about 800 hours of human video data and 30 hours of real-world robot data, outperforming baselines pre-trained on more than 10$\times$ as much data by over 40\% in overall success rate across multiple tasks. We will open-source the entire ecosystem to the community, including a data processing and training pipeline, a humanoid foundation model, and a real-time action inference engine.

翻译：我们介绍了$Ψ_0$（Psi-Zero），一个旨在解决具有挑战性的仿人机器人移动操作任务的开源基础模型。现有的方法通常尝试通过在大量多样化的人体与仿人机器人数据上进行联合训练来解决这一基本问题，但我们认为，由于人类与仿人机器人在运动学和运动模式上存在根本差异，该策略并非最优。因此，尽管数据量庞大，数据利用效率和模型性能仍不尽如人意。为应对这一挑战，\ours\;将学习过程解耦，以最大化异构数据源的效用。具体而言，我们提出了一种具有不同学习目标的分阶段训练范式：首先，我们在大规模人类第一人称视角视频上对视觉语言模型（VLM）主干进行自回归预训练，以获取可泛化的视觉-动作表征。随后，我们在高质量的仿人机器人数据上对基于流的动作专家进行后训练，以学习精确的机器人关节控制。我们的研究进一步揭示了一个关键但常被忽视的数据配方：与那些依赖嘈杂的互联网视频片段或异构跨具身机器人数据集进行规模扩展的方法不同，我们证明，先在高质量的人类第一人称视角操作数据上进行预训练，再在特定领域的真实世界仿人机器人轨迹上进行后训练，能带来更优的性能。大量的真实世界实验表明，\ours\;仅使用约800小时的人类视频数据和30小时的真实世界机器人数据，便取得了最佳性能，在多项任务中的总体成功率上，超越了使用超过10$\times$数据量进行预训练的基线模型超过40\%。我们将向社区开源整个生态系统，包括数据处理与训练流程、仿人机器人基础模型以及实时动作推理引擎。