In recent years, trust region on-policy reinforcement learning has achieved impressive results in addressing complex control tasks and gaming scenarios. However, contemporary state-of-the-art algorithms within this category primarily emphasize improvement in expected performance, lacking the ability to control over the worst-case performance outcomes. To address this limitation, we introduce a novel objective function; by optimizing which, it will lead to guaranteed monotonic improvement in the lower bound of near-total performance samples (absolute performance). Considering this groundbreaking theoretical advancement, we then refine this theoretically grounded algorithm through a series of approximations, resulting in a practical solution called Absolute Policy Optimization (APO). Our experiments demonstrate the effectiveness of our approach across challenging continuous control benchmark tasks and extend its applicability to mastering Atari games. Our findings reveal that APO significantly outperforms state-of-the-art policy gradient algorithms, resulting in substantial improvements in both expected performance and worst-case performance.
翻译:近年来,基于信任区域的在线强化学习在处理复杂控制任务和游戏场景中取得了显著成果。然而,当前该类别的先进算法主要强调期望性能的提升,缺乏对最差性能结果的控制能力。为解决这一局限,我们提出了一种新型目标函数,通过优化该函数,可确保近总体性能样本(绝对性能)的下界实现单调改进。基于这一突破性理论进展,我们通过一系列近似方法对该理论算法进行改进,最终形成名为绝对策略优化(APO)的实用方案。实验表明,我们的方法在具有挑战性的连续控制基准任务中表现出色,并扩展至Atari游戏掌握场景。研究结果揭示,APO显著优于最先进的策略梯度算法,在期望性能与最差性能方面均实现了实质性提升。