Self-play is a technique for machine learning in multi-agent systems where a learning algorithm learns by interacting with copies of itself. Self-play is useful for generating large quantities of data for learning, but has the drawback that the agents the learner will face post-training may have dramatically different behavior than the learner came to expect by interacting with itself. For the special case of two-player constant-sum games, self-play that reaches Nash equilibrium is guaranteed to produce strategies that perform well against any post-training opponent; however, no such guarantee exists for multi-player games. We show that in games that approximately decompose into a set of two-player constant-sum games (called polymatrix games) where global $\epsilon$-Nash equilibria are boundedly far from Nash-equilibria in each subgame, any no-external-regret algorithm that learns by self-play will produce a strategy with bounded vulnerability. For the first time, our results identify a structural property of multi-player games that enable performance guarantees for the strategies produced by a broad class of self-play algorithms. We demonstrate our findings through experiments on Leduc poker.
翻译:自我对弈是一种多智能体系统的机器学习技术,其中学习算法通过与其自身副本交互进行学习。该方法虽能生成海量学习数据,但其缺陷在于:训练完成后学习者所面对的对手,其行为可能与学习者在自我交互过程中预期的行为存在显著差异。对于双人常和博弈这一特例,达到纳什均衡的自我对弈能够保证产生在任意事后对手面前表现优异的策略;然而多人博弈中并无此类保障。本文证明:在可近似分解为一组双人常和博弈(称为多矩阵博弈)的博弈中,若全局ϵ-纳什均衡与各子博弈中纳什均衡的偏差有界,则任何通过自我对弈学习的无外部遗憾算法都将产生脆弱性有界的策略。我们的研究首次识别出多人博弈中能使广泛自我对弈算法所生成策略具有性能保障的结构特性,并通过在莱德克扑克上的实验验证了相关发现。