The combination of self-play and planning has achieved great successes in sequential games, for instance in Chess and Go. However, adapting algorithms such as AlphaZero to simultaneous games poses a new challenge. In these games, missing information about concurrent actions of other agents is a limiting factor as they may select different Nash equilibria or do not play optimally at all. Thus, it is vital to model the behavior of the other agents when interacting with them in simultaneous games. To this end, we propose Albatross: AlphaZero for Learning Bounded-rational Agents and Temperature-based Response Optimization using Simulated Self-play. Albatross learns to play the novel equilibrium concept of a Smooth Best Response Logit Equilibrium (SBRLE), which enables cooperation and competition with agents of any playing strength. We perform an extensive evaluation of Albatross on a set of cooperative and competitive simultaneous perfect-information games. In contrast to AlphaZero, Albatross is able to exploit weak agents in the competitive game of Battlesnake. Additionally, it yields an improvement of 37.6% compared to previous state of the art in the cooperative Overcooked benchmark.
翻译:自博弈与规划的结合在序贯博弈中取得了巨大成功,例如在国际象棋和围棋中。然而,将诸如AlphaZero等算法适配到同时博弈中则带来了新的挑战。在这类博弈中,关于其他智能体并发行动的信息缺失是一个限制因素,因为它们可能选择不同的纳什均衡或根本不采取最优策略。因此,在与同时博弈中的其他智能体交互时,对其行为进行建模至关重要。为此,我们提出了Albatross:一种基于模拟自博弈的、用于学习有限理性智能体及基于温度响应优化的AlphaZero方法。Albatross学习一种新颖的均衡概念——平滑最优响应对数均衡,从而能够与任何博弈强度的智能体进行合作与竞争。我们在合作性与竞争性同时完全信息博弈集合上对Albatross进行了广泛评估。与AlphaZero相比,Albatross能够在竞争性游戏Battlesnake中有效利用弱势智能体。此外,在合作性基准Overcooked上,其性能相比先前最优方法提升了37.6%。