Multi-agent learning algorithms have been shown to display complex, unstable behaviours in a wide array of games. In fact, previous works indicate that convergent behaviours are less likely to occur as the total number of agents increases. This seemingly prohibits convergence to stable strategies, such as Nash Equilibria, in games with many players. To make progress towards addressing this challenge we study the Q-Learning Dynamics, a classical model for exploration and exploitation in multi-agent learning. In particular, we study the behaviour of Q-Learning on games where interactions between agents are constrained by a network. We determine a number of sufficient conditions, depending on the game and network structure, which guarantee that agent strategies converge to a unique stable strategy, called the Quantal Response Equilibrium (QRE). Crucially, these sufficient conditions are independent of the total number of agents, allowing for provable convergence in arbitrarily large games. Next, we compare the learned QRE to the underlying NE of the game, by showing that any QRE is an $\epsilon$-approximate Nash Equilibrium. We first provide tight bounds on $\epsilon$ and show how these bounds lead naturally to a centralised scheme for choosing exploration rates, which enables independent learners to learn stable approximate Nash Equilibrium strategies. We validate the method through experiments and demonstrate its effectiveness even in the presence of numerous agents and actions. Through these results, we show that independent learning dynamics may converge to approximate Nash Equilibria, even in the presence of many agents.
翻译:多智能体学习算法已被证明在大量博弈中表现出复杂且不稳定的行为。事实上,先前的研究表明,随着智能体总数增加,收敛行为出现的可能性降低。这似乎阻碍了在多玩家博弈中收敛到稳定策略(如纳什均衡)的可能性。为推进解决这一挑战,我们研究了Q学习动力学——一种多智能体学习中探索与利用的经典模型。具体而言,我们分析了智能体间交互受网络约束的博弈中的Q学习行为,并确定了基于博弈与网络结构的一系列充分条件。这些条件可保证智能体策略收敛至唯一的稳定策略——量化响应均衡(QRE)。关键在于,这些充分条件与智能体总数无关,从而在任意大规模博弈中实现可证明的收敛性。随后,我们通过证明任何QRE都是ε-近似纳什均衡,将学习得到的QRE与博弈的底层NE进行比较:首先给出了ε的紧致边界,并展示这些边界如何自然导向一种集中式探索率选择方案,使独立学习者能掌握稳定的近似纳什均衡策略。我们通过实验验证了该方法,并证明其在包含大量智能体与动作的场景中依然有效。这些结果表明,即使在多智能体环境中,独立学习动力学仍可能收敛至近似纳什均衡。