Inspired by the recent success of sequence modeling in RL and the use of masked language model for pre-training, we propose a masked model for pre-training in RL, RePreM (Representation Pre-training with Masked Model), which trains the encoder combined with transformer blocks to predict the masked states or actions in a trajectory. RePreM is simple but effective compared to existing representation pre-training methods in RL. It avoids algorithmic sophistication (such as data augmentation or estimating multiple models) with sequence modeling and generates a representation that captures long-term dynamics well. Empirically, we demonstrate the effectiveness of RePreM in various tasks, including dynamic prediction, transfer learning, and sample-efficient RL with both value-based and actor-critic methods. Moreover, we show that RePreM scales well with dataset size, dataset quality, and the scale of the encoder, which indicates its potential towards big RL models.
翻译:受序列建模在强化学习中的近期成功以及掩码语言模型在预训练中应用的启发,我们提出了一种用于强化学习预训练的掩码模型RePreM(基于掩码模型的表征预训练)。该模型训练结合Transformer模块的编码器,以预测轨迹中的掩码状态或动作。与现有强化学习表征预训练方法相比,RePreM简洁而高效。它通过序列建模避免了算法复杂性(例如数据增强或多模型估计),并生成能良好捕捉长期动态的表征。实验证明,RePreM在动态预测、迁移学习以及基于价值函数和演员-评论家方法的样本高效强化学习等多项任务中均有效。此外,RePreM对数据集规模、数据集质量以及编码器规模均具有良好的可扩展性,这表明其在大规模强化学习模型中的潜力。