Among the great successes of Reinforcement Learning (RL), self-play algorithms play an essential role in solving competitive games. Current self-play algorithms optimize the agent to maximize expected win-rates against its current or historical copies, making it often stuck in the local optimum and its strategy style simple and homogeneous. A possible solution is to improve the diversity of policies, which helps the agent break the stalemate and enhances its robustness when facing different opponents. However, enhancing diversity in the self-play algorithms is not trivial. In this paper, we aim to introduce diversity from the perspective that agents could have diverse risk preferences in the face of uncertainty. Specifically, we design a novel reinforcement learning algorithm called Risk-sensitive Proximal Policy Optimization (RPPO), which smoothly interpolates between worst-case and best-case policy learning and allows for policy learning with desired risk preferences. Seamlessly integrating RPPO with population-based self-play, agents in the population optimize dynamic risk-sensitive objectives with experiences from playing against diverse opponents. Empirical results show that our method achieves comparable or superior performance in competitive games and that diverse modes of behaviors emerge. Our code is public online at \url{https://github.com/Jackory/RPBT}.
翻译:在强化学习(RL)的巨大成功中,自我对弈算法在解决竞争性游戏中发挥着关键作用。当前的自我对弈算法优化智能体以最大化对抗当前或历史副本的期望胜率,使其常陷入局部最优,策略风格简单且同质化。一种可能的解决方案是提高策略的多样性,这有助于智能体打破僵局,并增强其面对不同对手时的鲁棒性。然而,在自我对弈算法中增强多样性并非易事。本文旨在从智能体在面对不确定性时可能具有多样风险偏好的角度引入多样性。具体而言,我们设计了一种名为风险敏感近端策略优化(RPPO)的新型强化学习算法,该算法在最坏情况和最好情况策略学习之间平滑插值,并允许按所需风险偏好进行策略学习。将RPPO与基于群体的自我对弈无缝集成后,群体中的智能体利用与多样化对手对弈的经验优化动态风险敏感目标。实证结果表明,我们的方法在竞争性游戏中取得了可比或更优的性能,并涌现出多样化的行为模式。我们的代码已在网址\url{https://github.com/Jackory/RPBT}公开。