Offline reinforcement learning (RL) aims to optimize policy using collected data without online interactions. Model-based approaches are particularly appealing for addressing offline RL challenges due to their capability to mitigate the limitations of offline data through data generation using models. Prior research has demonstrated that introducing conservatism into the model or Q-function during policy optimization can effectively alleviate the prevalent distribution drift problem in offline RL. However, the investigation into the impacts of conservatism in reward estimation is still lacking. This paper proposes a novel model-based offline RL algorithm, Conservative Reward for model-based Offline Policy optimization (CROP), which conservatively estimates the reward in model training. To achieve a conservative reward estimation, CROP simultaneously minimizes the estimation error and the reward of random actions. Theoretical analysis shows that this conservative reward mechanism leads to a conservative policy evaluation and helps mitigate distribution drift. Experiments on D4RL benchmarks showcase that the performance of CROP is comparable to the state-of-the-art baselines. Notably, CROP establishes an innovative connection between offline and online RL, highlighting that offline RL problems can be tackled by adopting online RL techniques to the empirical Markov decision process trained with a conservative reward. The source code is available with https://github.com/G0K0URURI/CROP.git.
翻译:离线强化学习旨在无需在线交互的情况下,利用已收集数据优化策略。基于模型的方法由于能够通过模型生成数据来缓解离线数据局限性,因而在应对离线强化学习挑战方面尤为具有吸引力。先前研究表明,在策略优化过程中将保守性引入模型或Q函数,可有效缓解离线强化学习中普遍存在的分布漂移问题。然而,关于保守性在奖励估计中影响的研究仍存在不足。本文提出一种新的基于模型的离线强化学习算法——基于模型的离线策略优化的保守奖励(CROP),该算法在模型训练中对奖励进行保守估计。为实现保守奖励估计,CROP同时最小化随机动作的估计误差与奖励。理论分析表明,这种保守奖励机制能实现保守策略评估,并有助于缓解分布漂移。在D4RL基准上的实验显示,CROP的性能与最先进基线方法相当。值得注意的是,CROP建立了离线与在线强化学习之间的创新联系,揭示出离线强化学习问题可通过将在线强化学习技术应用于经保守奖励训练的经验马尔可夫决策过程来解决。源代码已发布于 https://github.com/G0K0URURI/CROP.git。