Offline reinforcement learning leverages pre-collected datasets of transitions to train policies. It can serve as effective initialization for online algorithms, enhancing sample efficiency and speeding up convergence. However, when such datasets are limited in size and quality, offline pre-training can produce sub-optimal policies and lead to degraded online reinforcement learning performance. In this paper we propose a model-based data augmentation strategy to maximize the benefits of offline reinforcement learning pre-training and reduce the scale of data needed to be effective. Our approach leverages a world model of the environment trained on the offline dataset to augment states during offline pre-training. We evaluate our approach on a variety of MuJoCo robotic tasks and our results show it can jump-start online fine-tuning and substantially reduce - in some cases by an order of magnitude - the required number of environment interactions.
翻译:离线强化学习利用预先收集的转换数据来训练策略,可作为在线算法的有效初始化,提升样本效率并加速收敛。然而,当此类数据集的规模和质量受限时,离线预训练可能产生次优策略,导致在线强化学习性能下降。本文提出一种基于模型的数据增强策略,旨在最大化离线强化学习预训练的收益,并降低有效训练所需的数据规模。该方法利用基于离线数据集训练的環境世界模型,在离线预训练阶段对状态进行增强。我们在多种MuJoCo机器人任务上评估了该方法,结果表明其能够有效启动在线微调,并显著减少(在某些情况下可达一个数量级)所需的环境交互次数。