Among the great successes of Reinforcement Learning (RL), self-play algorithms play an essential role in solving competitive games. Current self-play algorithms optimize the agent to maximize expected win-rates against its current or historical copies, making it often stuck in the local optimum and its strategy style simple and homogeneous. A possible solution is to improve the diversity of policies, which helps the agent break the stalemate and enhances its robustness when facing different opponents. However, enhancing diversity in the self-play algorithms is not trivial. In this paper, we aim to introduce diversity from the perspective that agents could have diverse risk preferences in the face of uncertainty. Specifically, we design a novel reinforcement learning algorithm called Risk-sensitive Proximal Policy Optimization (RPPO), which smoothly interpolates between worst-case and best-case policy learning and allows for policy learning with desired risk preferences. Seamlessly integrating RPPO with population-based self-play, agents in the population optimize dynamic risk-sensitive objectives with experiences from playing against diverse opponents. Empirical results show that our method achieves comparable or superior performance in competitive games and that diverse modes of behaviors emerge. Our code is public online at \url{https://github.com/Jackory/RPBT}.
翻译:在强化学习的辉煌成就中,自我对弈算法在解决竞争性游戏方面发挥着关键作用。当前的自我对弈算法优化智能体以最大化其当前或历史副本的期望胜率,这往往导致其陷入局部最优,策略风格单一且同质化。一个可能的解决方案是提高策略的多样性,这有助于智能体打破僵局,并增强其面对不同对手时的鲁棒性。然而,在自我对弈算法中增强多样性并非易事。本文旨在从智能体面对不确定性时可具有多样化风险偏好的角度引入多样性。具体而言,我们设计了一种新颖的强化学习算法——风险敏感近端策略优化(RPPO),该算法在最优与最差策略学习之间平滑插值,并允许按照期望的风险偏好进行策略学习。通过将RPPO无缝集成到基于种群的自我对弈中,种群中的智能体利用与多样化对手对战的经验来优化动态风险敏感目标。实证结果表明,我们的方法在竞争性游戏中达到了可比或更优的性能,并涌现出多样化的行为模式。我们的代码已开源在 \url{https://github.com/Jackory/RPBT}。