Self-play is a technique for machine learning in multi-agent systems where a learning algorithm learns by interacting with copies of itself. Self-play is useful for generating large quantities of data for learning, but has the drawback that the agents the learner will face post-training may have dramatically different behavior than the learner came to expect by interacting with itself. For the special case of two-player constant-sum games, self-play that reaches Nash equilibrium is guaranteed to produce strategies that perform well against any post-training opponent; however, no such guarantee exists for multiplayer games. We show that in games that approximately decompose into a set of two-player constant-sum games (called constant-sum polymatrix games) where global $\epsilon$-Nash equilibria are boundedly far from Nash equilibria in each subgame (called subgame stability), any no-external-regret algorithm that learns by self-play will produce a strategy with bounded vulnerability. For the first time, our results identify a structural property of multiplayer games that enable performance guarantees for the strategies produced by a broad class of self-play algorithms. We demonstrate our findings through experiments on Leduc poker.
翻译:自我对弈是一种多智能体系统中的机器学习技术,其中学习算法通过与自身的副本交互进行学习。该方法虽能生成大量学习数据,但存在缺陷:训练后学习器可能面临的对手行为与学习器在自交互过程中形成的预期行为存在显著差异。在双人常和博弈的特殊情形中,达到纳什均衡的自我对弈能保证所生成策略在对抗任何训练后对手时表现优异;然而,对于多人博弈尚不存在此类保证。我们证明:在可近似分解为一组双人常和博弈(称为常和多项式矩阵博弈)的博弈中,若全局$\epsilon$-纳什均衡与各子博弈(称为子博弈稳定性)中的纳什均衡距离有界,则任何通过自我对弈学习的无外部遗憾算法都将生成具有有界脆弱性的策略。我们的结果首次揭示了多人博弈中能使广泛自我对弈算法生成的策略获得性能保证的结构特性,并通过Leduc扑克实验验证了上述发现。