GRAIL: Generating Humanoid Loco-Manipulation from 3D Assets and Video Priors

Tianyi Xie,Haotian Zhang,Jinhyung Park,Zi Wang,Bowen Wen,Jiefeng Li,Xueting Li,Qingwei Ben,Haoyang Weng,Yufei Ye,David Minor,Tingwu Wang,Chenfanfu Jiang,Sanja Fidler,Jan Kautz,Linxi Fan,Yuke Zhu,Zhengyi Luo,Umar Iqbal,Ye Yuan

from arxiv, Project page: https://research.nvidia.com/labs/dair/grail/

Scaling humanoid loco-manipulation requires robot-compatible demonstrations across diverse objects, whole-body motions, and scene geometries, but teleoperation and motion capture are difficult to scale because each collection depends on physical setups, instrumented actors, and robot operation. We present GRAIL, a digital generation pipeline that remains fully virtual until deployment: it composes 3D assets, simulator-ready scenes, and priors from video foundation models (VFMs) to synthesize interactions without rebuilding physical environments or teleoperating the robot. Rather than reconstructing unconstrained in-the-wild videos, GRAIL starts from fully specified 3D configurations in which object geometry, camera parameters, metric scale, environment depth, and a robot-proportioned character are known before video generation and reused during reconstruction. This privileged setup better conditions 4D recovery, allowing model-based object tracking, human motion estimation, and interaction-aware optimization to reconstruct metric 4D human-object interaction (HOI) trajectories with reduced depth ambiguity and morphology mismatch. We retarget the recovered motions to a humanoid robot and train complementary task-general trackers: an object-aware latent adaptor for manipulation and a scene-aware tracker for terrain traversal. GRAIL produces over 20,000 sequences spanning pick-up, object manipulation, sitting, and terrain traversal. Using only GRAIL-generated data, we train egocentric visual policies through a sim-to-real pipeline and deploy them on a Unitree G1 humanoid, achieving 84\% real-world success on diverse object pick-up and 90\% success on stair-climbing.

翻译：拓展人形机器人全身操控能力需要涵盖多样物体、全身运动与场景几何的机器人兼容演示，但遥操作与动作捕捉难以规模化，因为每次采集都依赖于物理装置、穿戴传感器的演员以及机器人操作。我们提出GRAIL——一个在部署前完全保持虚拟化的数字生成管线：它组合3D资产、模拟器就绪场景与视频基础模型（VFM）的先验知识，在不重建物理环境或遥操作机器人的情况下合成交互动作。与无约束野外视频重建不同，GRAIL从完全指定的3D配置出发——在视频生成前已知物体几何、相机参数、度量尺度、环境深度及机器人比例角色，并在重建过程中复用这些信息。这种特权配置更好约束了4D恢复过程，使基于模型的物体跟踪、人体运动估计及交互感知优化得以重建具有更低深度模糊性与形态不匹配的度量4D人-物交互（HOI）轨迹。我们将恢复的运动重定向至人形机器人，并训练互补的任务通用跟踪器：面向操控的物体感知潜在适配器与面向地形遍历的场景感知跟踪器。GRAIL生成了覆盖拾取、物体操控、坐姿与地形遍历等场景的超过20,000组动作序列。仅使用GRAIL生成数据，我们通过仿真到现实管线训练以自我为中心的视觉策略，并将其部署至Unitree G1人形机器人，在多样化物体拾取任务中达到84%的真实世界成功率，在爬楼梯任务中达到90%成功率。