Centralized training is widely utilized in the field of multi-agent reinforcement learning (MARL) to assure the stability of training process. Once a joint policy is obtained, it is critical to design a value function factorization method to extract optimal decentralized policies for the agents, which needs to satisfy the individual-global-max (IGM) principle. While imposing additional limitations on the IGM function class can help to meet the requirement, it comes at the cost of restricting its application to more complex multi-agent environments. In this paper, we propose QFree, a universal value function factorization method for MARL. We start by developing mathematical equivalent conditions of the IGM principle based on the advantage function, which ensures that the principle holds without any compromise, removing the conservatism of conventional methods. We then establish a more expressive mixing network architecture that can fulfill the equivalent factorization. In particular, the novel loss function is developed by considering the equivalent conditions as regularization term during policy evaluation in the MARL algorithm. Finally, the effectiveness of the proposed method is verified in a nonmonotonic matrix game scenario. Moreover, we show that QFree achieves the state-of-the-art performance in a general-purpose complex MARL benchmark environment, Starcraft Multi-Agent Challenge (SMAC).
翻译:集中式训练广泛应用于多智能体强化学习(MARL)领域,以确保训练过程的稳定性。在获得联合策略后,设计一种值函数分解方法以提取各智能体的最优分散式策略至关重要,这需要满足个体-全局最大(IGM)原则。虽然对IGM函数类施加额外限制有助于满足该要求,但这也限制了其在更复杂多智能体环境中的应用。本文提出QFree——一种用于MARL的通用值函数分解方法。我们首先基于优势函数推导出IGM原则的数学等价条件,该条件确保原则在不做任何妥协的前提下成立,从而消除了传统方法的保守性。接着,我们构建了一种更具表达能力的混合网络架构,能够实现等价的分解。特别地,通过将等价条件作为正则项引入MARL算法中的策略评估过程,我们开发了新颖的损失函数。最后,在非单调矩阵博弈场景中验证了所提方法的有效性。此外,我们证明QFree在通用复杂MARL基准环境——星际争霸多智能体挑战(SMAC)中达到了最先进的性能。