In recent years, trust region on-policy reinforcement learning has achieved impressive results in addressing complex control tasks and gaming scenarios. However, contemporary state-of-the-art algorithms within this category primarily emphasize improvement in expected performance, lacking the ability to control over the worst-case performance outcomes. To address this limitation, we introduce a novel objective function; by optimizing which, it will lead to guaranteed monotonic improvement in the lower bound of near-total performance samples (absolute performance). Considering this groundbreaking theoretical advancement, we then refine this theoretically grounded algorithm through a series of approximations, resulting in a practical solution called Absolute Policy Optimization (APO). Our experiments demonstrate the effectiveness of our approach across challenging continuous control benchmark tasks and extend its applicability to mastering Atari games. Our findings reveal that APO significantly outperforms state-of-the-art policy gradient algorithms, resulting in substantial improvements in both expected performance and worst-case performance.
翻译:近年来,基于信赖域的在线强化学习在处理复杂控制任务和游戏场景方面取得了显著成果。然而,当前该类别的先进算法主要侧重于提升期望性能,缺乏对最差性能结果的控制能力。为解决这一局限,我们提出了一种新颖的目标函数;通过优化该函数,能够保证近全部性能样本下限(绝对性能)的单调提升。基于这一开创性的理论进展,我们通过一系列近似操作,将这一理论算法转化为实用方案,即绝对策略优化(Absolute Policy Optimization, APO)。实验表明,我们的方法在具有挑战性的连续控制基准测试任务中展现出有效性,并扩展至Atari游戏的掌握。研究结果揭示,APO显著优于最先进的策略梯度算法,在期望性能和最差性能方面均实现了大幅提升。