Designing and deriving effective model-based reinforcement learning (MBRL) algorithms with a performance improvement guarantee is challenging, mainly attributed to the high coupling between model learning and policy optimization. Many prior methods that rely on return discrepancy to guide model learning ignore the impacts of model shift, which can lead to performance deterioration due to excessive model updates. Other methods use performance difference bound to explicitly consider model shift. However, these methods rely on a fixed threshold to constrain model shift, resulting in a heavy dependence on the threshold and a lack of adaptability during the training process. In this paper, we theoretically derive an optimization objective that can unify model shift and model bias and then formulate a fine-tuning process. This process adaptively adjusts the model updates to get a performance improvement guarantee while avoiding model overfitting. Based on these, we develop a straightforward algorithm USB-PO (Unified model Shift and model Bias Policy Optimization). Empirical results show that USB-PO achieves state-of-the-art performance on several challenging benchmark tasks.
翻译:设计与推导具有性能改进保证的基于模型的强化学习(MBRL)算法颇具挑战性,这主要源于模型学习与策略优化之间的高度耦合。许多先前方法依赖回报差异来指导模型学习,却忽略了模型偏移的影响,这可能导致因过度模型更新而引发的性能退化。另一些方法利用性能差异界来显式考虑模型偏移,但这些方法依赖于固定阈值来约束模型偏移,导致对阈值的高度依赖,且在训练过程中缺乏适应性。本文从理论上推导出一个能够统一模型偏移与模型偏差的优化目标,进而制定了一个微调过程。该过程自适应地调整模型更新,在获得性能改进保证的同时避免模型过拟合。基于此,我们提出了一种简洁算法USB-PO(统一模型偏移与模型偏差的策略优化)。实验结果表明,USB-PO在多个具有挑战性的基准任务上达到了最先进的性能。