DreamPolicy: A Unified World-model Policy for Scalable Humanoid Locomotion

Achieving versatile humanoid locomotion with a single policy presents a critical scalability challenge. Prevailing methods often rely on distilling multiple terrain-specific teacher policies into a unified student policy. However, while such distillation captures basic locomotion primitives, it struggles to organically compose these skills to adapt to complex environments, resulting in poor generalization to novel composite terrains unseen during training. To overcome this, we present DreamPolicy, a unified framework that integrates offline data with a diffusion-based world model, enabling a single policy to master both known and unseen terrains. Central to our approach is a terrain-aware world model, driven by an autoregressive diffusion world model trained on aggregated rollouts from specialized policies. This model synthesizes physically plausible future trajectories, which serve as dynamic objectives for a conditioned policy, thereby bypassing manual reward engineering. Unlike distillation, our world model captures generalizable locomotion skills, allowing for robust zero-shot transfer to unseen composite terrains. DreamPolicy naturally scales with data availability. As the offline dataset expands, the diffusion world model continuously acquires richer skills. Experiments demonstrate that DreamPolicy outperforms the strongest baseline by up to 27\% on unseen terrains and 38\% on combined terrains. By unifying world model-based planning and policy learning, DreamPolicy breaks the "one task, one policy" bottleneck and establishes a scalable, data-driven paradigm for generalist humanoid control.

翻译：实现单一策略驱动的通用人形运动面临关键的可扩展性挑战。现有方法通常依赖将多个特定地形教师策略蒸馏为统一的学生策略，但此类蒸馏虽能捕捉基础运动基元，却难以有机组合这些技能以适应复杂环境，导致对训练中未见的新型组合地形泛化能力差。为攻克这一难题，我们提出DreamPolicy——将离线数据与基于扩散的世界模型相融合的统一框架，使单一策略既能掌握已知地形，也能适应未知地形。该框架的核心是地形感知世界模型，该模型由基于自回归扩散的世界模型驱动，并在专用策略生成的聚合轨迹上训练。该模型可合成物理上合理的未来轨迹，作为条件策略的动态目标，从而绕过人工奖励工程。与蒸馏不同，我们的世界模型可捕获具有泛化能力的运动技能，实现对未见组合地形的鲁棒零样本迁移。DreamPolicy天然具备随数据规模扩展的能力：随着离线数据集扩大，扩散世界模型持续习得更丰富的技能。实验表明，DreamPolicy在未见地形上性能超越最强基线达27%，在组合地形上提升达38%。通过统一基于世界模型的规划与策略学习，DreamPolicy打破了"单任务单策略"的瓶颈，为通用人形控制建立了可扩展、数据驱动的范式。