Unsupervised pre-training methods utilizing large and diverse datasets have achieved tremendous success across a range of domains. Recent work has investigated such unsupervised pre-training methods for model-based reinforcement learning (MBRL) but is limited to domain-specific or simulated data. In this paper, we study the problem of pre-training world models with abundant in-the-wild videos for efficient learning of downstream visual control tasks. However, in-the-wild videos are complicated with various contextual factors, such as intricate backgrounds and textured appearance, which precludes a world model from extracting shared world knowledge to generalize better. To tackle this issue, we introduce Contextualized World Models (ContextWM) that explicitly model both the context and dynamics to overcome the complexity and diversity of in-the-wild videos and facilitate knowledge transfer between distinct scenes. Specifically, a contextualized extension of the latent dynamics model is elaborately realized by incorporating a context encoder to retain contextual information and empower the image decoder, which allows the latent dynamics model to concentrate on essential temporal variations. Our experiments show that in-the-wild video pre-training equipped with ContextWM can significantly improve the sample-efficiency of MBRL in various domains, including robotic manipulation, locomotion, and autonomous driving.
翻译:利用大规模多样化数据集的非监督预训练方法在多个领域取得了巨大成功。近期研究开始探索此类非监督预训练方法在基于模型的强化学习(MBRL)中的应用,但现有工作主要局限于特定领域或模拟数据。本文研究如何利用丰富的野外视频预训练世界模型,以高效学习下游视觉控制任务。然而,野外视频因包含复杂背景、纹理外观等多种情境因素而具有复杂性,这阻碍了世界模型提取共享世界知识以实现更好的泛化。为解决该问题,我们提出情境化世界模型(ContextWM),通过显式建模情境与动态过程来克服野外视频的复杂性与多样性,促进不同场景间的知识迁移。具体而言,通过引入情境编码器保留情境信息并增强图像解码器,精心实现了潜在动态模型的情境化扩展,使得潜在动态模型能够聚焦于关键时序变化。实验表明,结合ContextWM的野外视频预训练方法可显著提升MBRL在机器人操作、移动导航及自动驾驶等多个领域的样本效率。