Many real-world applications involve some agents that fall into two teams, with payoffs that are equal within the same team but of opposite sign across the opponent team. The so-called two-team zero-sum Markov games (2t0sMGs) can be resolved with reinforcement learning in recent years. However, existing methods are thus inefficient in light of insufficient consideration of intra-team credit assignment, data utilization and computational intractability. In this paper, we propose the individual-global-minimax (IGMM) principle to ensure the coherence between two-team minimax behaviors and the individual greedy behaviors through Q functions in 2t0sMGs. Based on it, we present a novel multi-agent reinforcement learning framework, Factorized Multi-Agent MiniMax Q-Learning (FM3Q), which can factorize the joint minimax Q function into individual ones and iteratively solve for the IGMM-satisfied minimax Q functions for 2t0sMGs. Moreover, an online learning algorithm with neural networks is proposed to implement FM3Q and obtain the deterministic and decentralized minimax policies for two-team players. A theoretical analysis is provided to prove the convergence of FM3Q. Empirically, we use three environments to evaluate the learning efficiency and final performance of FM3Q and show its superiority on 2t0sMGs.
翻译:摘要: 许多实际应用涉及分属两个队伍的智能体,其收益在队内相等、在敌对队伍间符号相反。这类所谓的双队伍零和马尔可夫博弈(2t0sMGs)近年来可通过强化学习求解。然而,现有方法因未充分考虑队内信用分配、数据利用效率及计算复杂性而效率低下。本文提出个体-全局-极小极大(IGMM)原则,确保2t0sMGs中双队伍极小极大行为与个体贪婪行为通过Q函数保持一致性。基于此,我们提出新型多智能体强化学习框架——因子化多智能体极小极大Q学习(FM3Q),该框架可将联合极小极大Q函数分解为个体Q函数,并迭代求解满足IGMM条件的2t0sMGs极小极大Q函数。此外,我们提出基于神经网络的在线学习算法实现FM3Q,为两队玩家获取确定性且去中心化的极小极大策略。理论分析证明了FM3Q的收敛性。在实验中,我们使用三个环境评估FM3Q的学习效率与最终性能,并展示了其在2t0sMGs上的优越性。