The behaviour of multi-agent learning in competitive settings is often considered under the restrictive assumption of a zero-sum game. Only under this strict requirement is the behaviour of learning well understood; beyond this, learning dynamics can often display non-convergent behaviours which prevent fixed-point analysis. Nonetheless, many relevant competitive games do not satisfy the zero-sum assumption. Motivated by this, we study a smooth variant of Q-Learning, a popular reinforcement learning dynamics which balances the agents' tendency to maximise their payoffs with their propensity to explore the state space. We examine this dynamic in games which are `close' to network zero-sum games and find that Q-Learning converges to a neighbourhood around a unique equilibrium. The size of the neighbourhood is determined by the `distance' to the zero-sum game, as well as the exploration rates of the agents. We complement these results by providing a method whereby, given an arbitrary network game, the `nearest' network zero-sum game can be found efficiently. As our experiments show, these guarantees are independent of whether the dynamics ultimately reach an equilibrium, or remain non-convergent.
翻译:在竞争性环境中,多智能体学习的行为通常是在零和博弈这一严格假设下进行研究的。仅有在此严格要求下,学习的行为才能被充分理解;超出此范围,学习动态常常表现出非收敛行为,从而阻碍了不动点分析。然而,许多相关的竞争性博弈并不满足零和假设。受此启发,我们研究了一种平滑变体的Q-学习,这是一种流行的强化学习动态,它平衡了智能体最大化其收益的倾向与其探索状态空间的倾向。我们在“接近”网络零和博弈的博弈中检验了这一动态,并发现Q-学习收敛到一个唯一均衡点附近的邻域内。该邻域的大小由与零和博弈的“距离”以及智能体的探索率共同决定。为了补充这些结果,我们提供了一种方法:对于任意给定的网络博弈,能够高效地找到其“最近”的网络零和博弈。正如我们的实验所示,这些保证独立于动态最终是达到均衡还是保持非收敛状态。