Offline reinforcement learning (RL) aims to learn policies from static datasets of previously collected trajectories. Existing methods for offline RL either constrain the learned policy to the support of offline data or utilize model-based virtual environments to generate simulated rollouts. However, these methods suffer from (i) poor generalization to unseen states; and (ii) trivial improvement from low-qualified rollout simulation. In this paper, we propose offline trajectory generalization through world transformers for offline reinforcement learning (OTTO). Specifically, we use casual Transformers, a.k.a. World Transformers, to predict state dynamics and the immediate reward. Then we propose four strategies to use World Transformers to generate high-rewarded trajectory simulation by perturbing the offline data. Finally, we jointly use offline data with simulated data to train an offline RL algorithm. OTTO serves as a plug-in module and can be integrated with existing offline RL methods to enhance them with better generalization capability of transformers and high-rewarded data augmentation. Conducting extensive experiments on D4RL benchmark datasets, we verify that OTTO significantly outperforms state-of-the-art offline RL methods.
翻译:离线强化学习(Offline RL)旨在从先前收集的静态轨迹数据集中学习策略。现有的离线RL方法要么将所学策略约束在离线数据支持域内,要么利用基于模型的虚拟环境生成模拟轨迹。然而,这些方法存在以下问题:(i) 对未见状态的泛化能力差;(ii) 低质量模拟轨迹带来的改进效果有限。本文提出基于世界变换器(World Transformers)的离线轨迹泛化方法——OTTO(Offline Trajectory Generalization through World Transformers for Offline Reinforcement Learning)。具体而言,我们采用因果变换器(即世界变换器)预测状态动力学与即时奖励;进而提出四种策略,通过扰动离线数据生成高奖励模拟轨迹;最后将离线数据与模拟数据联合训练离线RL算法。OTTO作为即插即用模块,可集成至现有离线RL方法中,通过变换器的强泛化能力与高奖励数据增强技术提升其性能。在D4RL基准数据集上的大量实验表明,OTTO显著优于当前最先进的离线RL方法。