In recent years, trust region on-policy reinforcement learning has achieved impressive results in addressing complex control tasks and gaming scenarios. However, contemporary state-of-the-art algorithms within this category primarily emphasize improvement in expected performance, lacking the ability to control over the worst-case performance outcomes. To address this limitation, we introduce a novel objective function; by optimizing which, it will lead to guaranteed monotonic improvement in the lower bound of near-total performance samples (absolute performance). Considering this groundbreaking theoretical advancement, we then refine this theoretically grounded algorithm through a series of approximations, resulting in a practical solution called Absolute Policy Optimization (APO). Our experiments demonstrate the effectiveness of our approach across challenging continuous control benchmark tasks and extend its applicability to mastering Atari games. Our findings reveal that APO significantly outperforms state-of-the-art policy gradient algorithms, resulting in substantial improvements in both expected performance and worst-case performance.
翻译:近年来,基于信任域的在策略强化学习在解决复杂控制任务和游戏场景方面取得了显著成果。然而,当前这类算法中的最先进方法主要侧重于提升期望性能,缺乏对最差性能结果的控制能力。为解决这一局限,我们引入了一种新型目标函数;通过优化该函数,能够保证近总性能样本(绝对性能)下界的单调提升。基于这一开创性理论进展,我们通过一系列近似优化将该理论驱动算法转化为实用方案,称为绝对策略优化(APO)。实验表明,我们的方法在具有挑战性的连续控制基准任务中有效,并扩展至精通Atari游戏。研究发现,APO显著优于现有最优策略梯度算法,在期望性能与最差性能方面均取得了实质性提升。