The combination of self-play and planning has achieved great successes in sequential games, for instance in Chess and Go. However, adapting algorithms such as AlphaZero to simultaneous games poses a new challenge. In these games, missing information about concurrent actions of other agents is a limiting factor as they may select different Nash equilibria or do not play optimally at all. Thus, it is vital to model the behavior of the other agents when interacting with them in simultaneous games. To this end, we propose Albatross: AlphaZero for Learning Bounded-rational Agents and Temperature-based Response Optimization using Simulated Self-play. Albatross learns to play the novel equilibrium concept of a Smooth Best Response Logit Equilibrium (SBRLE), which enables cooperation and competition with agents of any playing strength. We perform an extensive evaluation of Albatross on a set of cooperative and competitive simultaneous perfect-information games. In contrast to AlphaZero, Albatross is able to exploit weak agents in the competitive game of Battlesnake. Additionally, it yields an improvement of 37.6% compared to previous state of the art in the cooperative Overcooked benchmark.
翻译:自我对弈与规划的结合在序列博弈中取得了巨大成功,例如国际象棋和围棋。然而,将AlphaZero等算法适应至同时博弈提出了新挑战。在这类博弈中,关于其他智能体并发动作的信息缺失是一个限制因素,因为它们可能选择不同的纳什均衡或并非完全最优地执行行动。因此,在同时博弈中与其他智能体交互时,对其行为进行建模至关重要。为此,我们提出Albatross:基于边界理性智能体学习与温度响应优化的AlphaZero模拟自我对弈方法。Albatross学习了一种新颖的均衡概念——平滑最优响应对数均衡(SBRLE),从而能够与任意强度的智能体进行合作与竞争。我们在多组合作与竞争型完美信息同时博弈上对Albatross进行了全面评估。与AlphaZero相比,Albatross能够在竞争型游戏Battlesnake中利用弱智能体进行对抗。此外,在合作型Overcooked基准测试中,其性能较此前最先进方法提升了37.6%。