Self-play is a technique for machine learning in multi-agent systems where a learning algorithm learns by interacting with copies of itself. Self-play is useful for generating large quantities of data for learning, but has the drawback that the agents the learner will face post-training may have dramatically different behavior than the learner came to expect by interacting with itself. For the special case of two-player constant-sum games, self-play that reaches Nash equilibrium is guaranteed to produce strategies that perform well against any post-training opponent; however, no such guarantee exists for multiplayer games. We show that in games that approximately decompose into a set of two-player constant-sum games (called constant-sum polymatrix games) where global $\epsilon$-Nash equilibria are boundedly far from Nash equilibria in each subgame (called subgame stability), any no-external-regret algorithm that learns by self-play will produce a strategy with bounded vulnerability. For the first time, our results identify a structural property of multiplayer games that enable performance guarantees for the strategies produced by a broad class of self-play algorithms. We demonstrate our findings through experiments on Leduc poker.
翻译:自我对弈是一种多智能体系统中的机器学习技术,其中学习算法通过与自身副本交互进行学习。该方法虽有利于生成大规模学习数据,但存在一个缺陷:训练后的对手可能表现出与学习者在自我交互中所预期的截然不同的行为。对于两人常和博弈这一特例,达到纳什均衡的自我对弈能保证产生对任何训练后对手均表现良好的策略;然而,多人博弈中并不存在此类保障。本文证明,在可近似分解为一组两人常和博弈(称为常和多矩阵博弈)的博弈中,若全局ε-纳什均衡与各子博弈中的纳什均衡有界偏离(称为子博弈稳定性),则任何通过自我对弈学习的无外部遗憾算法都将生成具有有界脆弱性的策略。我们的结果首次揭示了多人博弈中能使广泛自我对弈算法策略获得性能保障的结构特性,并通过在Leduc扑克上的实验验证了上述发现。