World models play a crucial role in decision-making within embodied environments, enabling cost-free explorations that would otherwise be expensive in the real world. To facilitate effective decision-making, world models must be equipped with strong generalizability to support faithful imagination in out-of-distribution (OOD) regions and provide reliable uncertainty estimation to assess the credibility of the simulated experiences, both of which present significant challenges for prior scalable approaches. This paper introduces WHALE, a framework for learning generalizable world models, consisting of two key techniques: behavior-conditioning and retracing-rollout. Behavior-conditioning addresses the policy distribution shift, one of the primary sources of the world model generalization error, while retracing-rollout enables efficient uncertainty estimation without the necessity of model ensembles. These techniques are universal and can be combined with any neural network architecture for world model learning. Incorporating these two techniques, we present Whale-ST, a scalable spatial-temporal transformer-based world model with enhanced generalizability. We demonstrate the superiority of Whale-ST in simulation tasks by evaluating both value estimation accuracy and video generation fidelity. Additionally, we examine the effectiveness of our uncertainty estimation technique, which enhances model-based policy optimization in fully offline scenarios. Furthermore, we propose Whale-X, a 414M parameter world model trained on 970K trajectories from Open X-Embodiment datasets. We show that Whale-X exhibits promising scalability and strong generalizability in real-world manipulation scenarios using minimal demonstrations.
翻译:世界模型在具身环境决策中发挥着关键作用,能够实现现实世界中代价高昂的免费探索。为促进有效决策,世界模型需具备强大的泛化能力以支持分布外区域的可靠想象,并提供可靠的不确定性估计以评估模拟经验的可信度,这两者对先前的可扩展方法均构成重大挑战。本文提出WHALE——一个学习通用世界模型的框架,包含两项关键技术:行为条件化与回溯推演。行为条件化解决了策略分布偏移这一世界模型泛化误差的主要来源,而回溯推演则无需模型集成即可实现高效的不确定性估计。这些技术具有普适性,可与任何神经网络架构结合用于世界模型学习。基于这两项技术,我们提出了Whale-ST——一个具备增强泛化能力的可扩展时空Transformer世界模型。我们通过评估价值估计精度与视频生成保真度,在仿真任务中验证了Whale-ST的优越性。此外,我们检验了不确定性估计技术的有效性,该技术能增强完全离线场景下的基于模型的策略优化。进一步地,我们提出了Whale-X——一个基于Open X-Embodiment数据集中97万条轨迹训练的4.14亿参数世界模型。实验表明,Whale-X在现实世界操控场景中展现出良好的可扩展性和强大的泛化能力,仅需少量演示即可实现优异性能。