LHM-Humanoid: Learning a Unified Policy for Long-Horizon Humanoid Whole-Body Loco-Manipulation in Diverse Messy Environments

We introduce LHM-Humanoid, a benchmark and learning framework for long-horizon whole-body humanoid loco-manipulation in diverse, cluttered scenes. In our setting, multiple objects are displaced from their intended locations and may obstruct navigation; a humanoid agent must repeatedly (i) walk to a target, (ii) pick it up with diverse whole-body postures under balance constraints, (iii) carry it while navigating around obstacles, and (iv) place it at a designated goal -- all within a single continuous episode and without any environment reset. This task simultaneously demands cross-scene generalization and unified one-policy control: layouts, obstacle arrangements, object category/mass/shape/color and object start/goal poses vary substantially even within a room category, requiring a single general policy that directly outputs actions rather than invoking pre-trained skill libraries. Our dataset spans four room types (bedroom, living room, kitchen, and warehouse), comprising 350 diverse scenes/tasks with 79 objects (25 movable targets). Since no scene-specific ground-truth motion sequences are provided, we learn goal-conditioned teacher policies via reinforcement learning and distill them into a single end-to-end student policy using DAgger. We further distill this unified policy into a vision-language-action (VLA) model driven by egocentric RGB observations and natural language. Experiments in Isaac Gym demonstrate that LHM-Humanoid substantially outperforms end-to-end RL baselines and prior humanoid loco-manipulation methods on both seen and unseen scenes, exhibiting strong long-horizon robustness and cross-scene generalization.

翻译：我们提出了LHM-Humanoid，这是一个用于多样化杂乱场景中长时程全身人形机器人移动操作的基准与学习框架。在我们的设定中，多个物体从其预定位置被移开，并可能阻碍导航；人形智能体必须在一个连续不间断的片段内，无需任何环境重置，反复执行以下操作：(i) 走向目标，(ii) 在平衡约束下以多样化的全身姿态拾取它，(iii) 携带它并绕过障碍物导航，(iv) 将其放置到指定目标位置。该任务同时要求跨场景泛化与统一的单策略控制：即使在同一房间类别内，布局、障碍物排列、物体类别/质量/形状/颜色以及物体的起始/目标位姿也存在显著差异，这需要一个能够直接输出动作而非调用预训练技能库的单一通用策略。我们的数据集涵盖四种房间类型（卧室、客厅、厨房和仓库），包含350个多样化场景/任务，涉及79个物体（其中25个为可移动目标）。由于未提供特定场景的真实运动序列，我们通过强化学习训练目标条件教师策略，并使用DAgger将其蒸馏为单一的端到端学生策略。我们进一步将此统一策略蒸馏为由以自我为中心的RGB观测和自然语言驱动的视觉-语言-动作模型。在Isaac Gym中的实验表明，LHM-Humanoid在已见和未见场景上均显著优于端到端强化学习基线方法和先前的人形机器人移动操作方法，展现出强大的长时程鲁棒性和跨场景泛化能力。