Self-play (SP) is a popular multi-agent reinforcement learning (MARL) framework for solving competitive games, where each agent optimizes policy by treating others as part of the environment. Despite the empirical successes, the theoretical properties of SP-based methods are limited to two-player zero-sum games. However, for mixed cooperative-competitive games where agents on the same team need to cooperate with each other, we can show a simple counter-example where SP-based methods cannot converge to a global Nash equilibrium (NE) with high probability. Alternatively, Policy-Space Response Oracles (PSRO) is an iterative framework for learning NE, where the best responses w.r.t. previous policies are learned in each iteration. PSRO can be directly extended to mixed cooperative-competitive settings by jointly learning team best responses with all convergence properties unchanged. However, PSRO requires repeatedly training joint policies from scratch till convergence, which makes it hard to scale to complex games. In this work, we develop a novel algorithm, Fictitious Cross-Play (FXP), which inherits the benefits from both frameworks. FXP simultaneously trains an SP-based main policy and a counter population of best response policies. The main policy is trained by fictitious self-play and cross-play against the counter population, while the counter policies are trained as the best responses to the main policy's past versions. We validate our method in matrix games and show that FXP converges to global NEs while SP methods fail. We also conduct experiments in a gridworld domain, where FXP achieves higher Elo ratings and lower exploitabilities than baselines, and a more challenging football game, where FXP defeats SOTA models with over 94% win rate.
翻译:自我博弈(SP)是解决竞争性游戏的一种流行多智能体强化学习(MARL)框架,其中每个智能体将其他智能体视为环境的一部分来优化策略。尽管实证成功,SP方法的理论性质仅限于两人零和博弈。然而,在需要同队智能体相互合作的混合合作-竞争游戏中,我们可以展示一个简单反例:SP方法无法以高概率收敛到全局纳什均衡(NE)。另一种方案——策略空间响应预言(PSRO)——是一种迭代式NE学习框架,每轮迭代中学习针对先前策略的最优响应。PSRO可直接扩展至混合合作-竞争场景,通过联合学习团队最优响应并保持所有收敛性质不变。但PSRO需要从零开始反复训练联合策略直至收敛,这使其难以扩展至复杂游戏。本文提出新算法——虚构交叉博弈(FXP),融合了两个框架的优点。FXP同时训练基于SP的主策略和一组对立最优响应策略。主策略通过虚构自我博弈和与对立群体的交叉博弈进行训练,而对立策略则作为针对主策略历史版本的最优响应进行训练。我们在矩阵游戏中验证了该方法,表明FXP在SP方法失败时仍能收敛到全局NE。我们还在地牢网格域中进行实验,FXP获得了比基线更高的埃洛评级和更低的可剥削性;在更具挑战性的足球游戏中,FXP以超过94%的胜率击败了最先进模型。