This work focuses on equilibrium selection in no-conflict multi-agent games, where we specifically study the problem of selecting a Pareto-optimal equilibrium among several existing equilibria. It has been shown that many state-of-the-art multi-agent reinforcement learning (MARL) algorithms are prone to converging to Pareto-dominated equilibria due to the uncertainty each agent has about the policy of the other agents during training. To address sub-optimal equilibrium selection, we propose Pareto Actor-Critic (Pareto-AC), which is an actor-critic algorithm that utilises a simple property of no-conflict games (a superset of cooperative games): the Pareto-optimal equilibrium in a no-conflict game maximises the returns of all agents and therefore is the preferred outcome for all agents. We evaluate Pareto-AC in a diverse set of multi-agent games and show that it converges to higher episodic returns compared to seven state-of-the-art MARL algorithms and that it successfully converges to a Pareto-optimal equilibrium in a range of matrix games. Finally, we propose PACDCG, a graph neural network extension of Pareto-AC which is shown to efficiently scale in games with a large number of agents.
翻译:本工作聚焦于无冲突多智能体博弈中的均衡选择问题,具体研究在多个现有均衡中选择帕累托最优均衡的课题。已有研究表明,由于训练过程中每个智能体对其他智能体策略存在不确定性,许多最先进的多智能体强化学习算法倾向于收敛至帕累托支配均衡。为解决次优均衡选择问题,我们提出帕累托行动者-评论家算法,该算法利用无冲突博弈(合作博弈的超集)的一个简单性质:无冲突博弈中的帕累托最优均衡能最大化所有智能体的回报,因此是所有智能体偏好的结果。我们在多种多智能体博弈中评估了帕累托行动者-评论家算法,结果表明相较于七种最先进的多智能体强化学习算法,该算法能收敛至更高的回合回报,并在多种矩阵博弈中成功收敛至帕累托最优均衡。最后,我们提出PACDCG——一种基于图神经网络的帕累托行动者-评论家扩展算法,实验证明该算法能在包含大量智能体的博弈中高效扩展。