Offline reinforcement learning leverages pre-collected datasets of transitions to train policies. It can serve as effective initialization for online algorithms, enhancing sample efficiency and speeding up convergence. However, when such datasets are limited in size and quality, offline pre-training can produce sub-optimal policies and lead to degraded online reinforcement learning performance. In this paper we propose a model-based data augmentation strategy to maximize the benefits of offline reinforcement learning pre-training and reduce the scale of data needed to be effective. Our approach leverages a world model of the environment trained on the offline dataset to augment states during offline pre-training. We evaluate our approach on a variety of MuJoCo robotic tasks and our results show it can jump-start online fine-tuning and substantially reduce - in some cases by an order of magnitude - the required number of environment interactions.
翻译:离线强化学习利用预先收集的转换数据集来训练策略。它可以作为在线算法的有效初始化方法,提高样本效率并加速收敛。然而,当此类数据集在规模和质量上受限时,离线预训练可能会产生次优策略,并导致在线强化学习性能下降。本文提出了一种基于模型的数据增强策略,以最大化离线强化学习预训练的收益,并减少有效所需的数据规模。我们的方法利用在离线数据集上训练的环境世界模型,在离线预训练期间增强状态。我们在多种MuJoCo机器人任务上评估了该方法,结果表明,它可以启动在线微调,并显著减少——在某些情况下减少一个数量级——所需的环境交互次数。