Unsupervised pre-training methods utilizing large and diverse datasets have achieved tremendous success across a range of domains. Recent work has investigated such unsupervised pre-training methods for model-based reinforcement learning (MBRL) but is limited to domain-specific or simulated data. In this paper, we study the problem of pre-training world models with abundant in-the-wild videos for efficient learning of downstream visual control tasks. However, in-the-wild videos are complicated with various contextual factors, such as intricate backgrounds and textured appearance, which precludes a world model from extracting shared world knowledge to generalize better. To tackle this issue, we introduce Contextualized World Models (ContextWM) that explicitly separate context and dynamics modeling to overcome the complexity and diversity of in-the-wild videos and facilitate knowledge transfer between distinct scenes. Specifically, a contextualized extension of the latent dynamics model is elaborately realized by incorporating a context encoder to retain contextual information and empower the image decoder, which encourages the latent dynamics model to concentrate on essential temporal variations. Our experiments show that in-the-wild video pre-training equipped with ContextWM can significantly improve the sample efficiency of MBRL in various domains, including robotic manipulation, locomotion, and autonomous driving. Code is available at this repository: https://github.com/thuml/ContextWM.
翻译:利用大规模多样化数据集的无监督预训练方法已在众多领域取得巨大成功。近期研究探索了此类无监督预训练方法在基于模型的强化学习(MBRL)中的应用,但仅限于领域特定或模拟数据。本文研究利用丰富的野外视频预训练世界模型,以高效学习下游视觉控制任务。然而,野外视频包含各种复杂情境因素(如复杂背景和纹理外观),阻碍世界模型提取共享世界知识以提升泛化能力。为解决这一问题,我们提出情境化世界模型(ContextWM),通过显式分离情境建模与动态建模,克服野外视频的复杂性与多样性,促进不同场景间的知识迁移。具体而言,我们通过引入情境编码器保留情境信息并增强图像解码器,精妙实现了潜动态模型的情境化扩展,促使潜动态模型专注于关键时序变化。实验表明,配备ContextWM的野外视频预训练能显著提升MBRL在机器人操作、移动控制和自动驾驶等多个领域的样本效率。代码开源于:https://github.com/thuml/ContextWM。