In recent years, trust region on-policy reinforcement learning has achieved impressive results in addressing complex control tasks and gaming scenarios. However, contemporary state-of-the-art algorithms within this category primarily emphasize improvement in expected performance, lacking the ability to control over the worst-case performance outcomes. To address this limitation, we introduce a novel objective function, optimizing which leads to guaranteed monotonic improvement in the lower probability bound of performance with high confidence. Building upon this groundbreaking theoretical advancement, we further introduce a practical solution called Absolute Policy Optimization (APO). Our experiments demonstrate the effectiveness of our approach across challenging continuous control benchmark tasks and extend its applicability to mastering Atari games. Our findings reveal that APO as well as its efficient variation Proximal Absolute Policy Optimization (PAPO) significantly outperforms state-of-the-art policy gradient algorithms, resulting in substantial improvements in worst-case performance, as well as expected performance.
翻译:近年来,基于信任域的策略梯度强化学习在处理复杂控制任务和游戏场景中取得了显著成果。然而,当前该类先进算法主要关注期望性能的提升,缺乏对最差性能表现的控制能力。为解决这一局限,我们提出了一种新的目标函数,通过优化该函数能够以高置信度保证性能下界概率的单调提升。基于这一突破性理论进展,我们进一步提出了名为绝对策略优化(APO)的实用解决方案。实验表明,该方法在具有挑战性的连续控制基准任务中表现有效,并能扩展应用于掌握Atari游戏。我们的研究发现,APO及其高效变体近端绝对策略优化(PAPO)显著优于当前最先进的策略梯度算法,在最差性能及期望性能方面均实现了显著提升。