In recent years, trust region on-policy reinforcement learning has achieved impressive results in addressing complex control tasks and gaming scenarios. However, contemporary state-of-the-art algorithms within this category primarily emphasize improvement in expected performance, lacking the ability to control over the worst-case performance outcomes. To address this limitation, we introduce a novel objective function, optimizing which leads to guaranteed monotonic improvement in the lower probability bound of performance with high confidence. Building upon this groundbreaking theoretical advancement, we further introduce a practical solution called Absolute Policy Optimization (APO). Our experiments demonstrate the effectiveness of our approach across challenging continuous control benchmark tasks and extend its applicability to mastering Atari games. Our findings reveal that APO as well as its efficient variation Proximal Absolute Policy Optimization (PAPO) significantly outperforms state-of-the-art policy gradient algorithms, resulting in substantial improvements in worst-case performance, as well as expected performance.
翻译:近年来,基于信任域的在线强化学习在解决复杂控制任务和游戏场景方面取得了令人瞩目的成果。然而,该领域当前最先进的算法主要关注期望性能的提升,缺乏对最坏情况性能结果的控制能力。为克服这一局限,我们提出了一种新颖的目标函数,优化该函数能够以高置信度保证性能下概率界的单调改进。基于这一突破性理论进展,我们进一步提出了一种实用解决方案——绝对策略优化(APO)。实验结果表明,我们的方法在具有挑战性的连续控制基准任务中展现出卓越效能,并成功扩展至Atari游戏的精通任务。研究发现,APO及其高效变体近端绝对策略优化(PAPO)显著优于最先进的策略梯度算法,在最坏情况性能和期望性能方面均实现了实质性提升。