Deep Reinforcement Learning (RL) has become the leading approach for creating artificial agents in complex environments. Model-based approaches, which are RL methods with world models that predict environment dynamics, are among the most promising directions for improving data efficiency, forming a critical step toward bridging the gap between research and real-world deployment. In particular, world models enhance sample efficiency by learning in imagination, which involves training a generative sequence model of the environment in a self-supervised manner. Recently, Masked Generative Modelling has emerged as a more efficient and superior inductive bias for modelling and generating token sequences. Building on the Efficient Stochastic Transformer-based World Models (STORM) architecture, we replace the traditional MLP prior with a Masked Generative Prior (e.g., MaskGIT Prior) and introduce GIT-STORM. We evaluate our model on two downstream tasks: reinforcement learning and video prediction. GIT-STORM demonstrates substantial performance gains in RL tasks on the Atari 100k benchmark. Moreover, we apply Transformer-based World Models to continuous action environments for the first time, addressing a significant gap in prior research. To achieve this, we employ a state mixer function that integrates latent state representations with actions, enabling our model to handle continuous control tasks. We validate this approach through qualitative and quantitative analyses on the DeepMind Control Suite, showcasing the effectiveness of Transformer-based World Models in this new domain. Our results highlight the versatility and efficacy of the MaskGIT dynamics prior, paving the way for more accurate world models and effective RL policies.
翻译:深度强化学习已成为在复杂环境中创建智能体的主流方法。基于模型的方法——即配备能够预测环境动态的世界模型的强化学习方法——是提升数据效率最具前景的方向之一,构成了缩小研究与应用之间差距的关键步骤。具体而言,世界模型通过在想象中学习来提升样本效率,这涉及以自监督方式训练环境的生成式序列模型。近年来,掩码生成建模已成为建模和生成标记序列更高效且更具优势的归纳偏置。基于高效的随机Transformer世界模型架构,我们将传统的多层感知机先验替换为掩码生成先验,并提出了GIT-STORM模型。我们在两项下游任务上评估模型性能:强化学习与视频预测。在Atari 100k基准测试中,GIT-STORM在强化学习任务上展现出显著的性能提升。此外,我们首次将基于Transformer的世界模型应用于连续动作环境,填补了先前研究的重要空白。为实现这一目标,我们采用状态混合函数将潜在状态表征与动作信息相融合,使模型能够处理连续控制任务。通过在DeepMind控制套件上进行定性与定量分析,我们验证了该方法在新领域中的有效性。研究结果凸显了MaskGIT动态先验的通用性与高效性,为构建更精确的世界模型和更有效的强化学习策略开辟了新路径。