Model-based methods provide an effective approach to offline reinforcement learning (RL). They learn an environmental dynamics model from interaction experiences and then perform policy optimization based on the learned model. However, previous model-based offline RL methods lack long-term prediction capability, resulting in large errors when generating multi-step trajectories. We address this issue by developing a sequence modeling architecture, Environment Transformer, which can generate reliable long-horizon trajectories based on offline datasets. We then propose a novel model-based offline RL algorithm, ENTROPY, that learns the dynamics model and reward function by ENvironment TRansformer and performs Offline PolicY optimization. We evaluate the proposed method on MuJoCo continuous control RL environments. Results show that ENTROPY performs comparably or better than the state-of-the-art model-based and model-free offline RL methods and demonstrates more powerful long-term trajectory prediction capability compared to existing model-based offline methods.
翻译:基于模型的方法为离线强化学习提供了一种有效途径。这类方法从交互经验中学习环境动力学模型,然后基于所学模型进行策略优化。然而,现有基于模型的离线强化学习方法缺乏长期预测能力,在生成多步轨迹时会产生较大误差。为应对此问题,我们开发了一种序列建模架构——环境变换器(Environment Transformer),该架构能够基于离线数据集生成可靠的长期轨迹。随后,我们提出了一种新颖的基于模型的离线强化学习算法ENTROPY,该算法通过环境变换器学习动力学模型和奖励函数,并执行离线策略优化。我们在MuJoCo连续控制强化学习环境中评估了所提方法。结果表明,ENTROPY在性能上与最先进的基于模型和无模型的离线强化学习方法相当或更优,并且相比现有基于模型的离线方法展现出更强的长期轨迹预测能力。