Multi-agent credit assignment is a fundamental challenge for cooperative multi-agent reinforcement learning (MARL), where a team of agents learn from shared reward signals. The Individual-Global-Max (IGM) condition is a widely used principle for multi-agent credit assignment, requiring that the joint action determined by individual Q-functions maximizes the global Q-value. Meanwhile, the principle of maximum entropy has been leveraged to enhance exploration in MARL. However, we identify a critical limitation in existing maximum entropy MARL methods: a misalignment arises between local policies and the joint policy that maximizes the global Q-value, leading to violations of the IGM condition. To address this misalignment, we propose an order-preserving transformation. Building on it, we introduce ME-IGM, a novel maximum entropy MARL algorithm compatible with any credit assignment mechanism that satisfies the IGM condition while enjoying the benefits of maximum entropy exploration. We empirically evaluate two variants of ME-IGM: ME-QMIX and ME-QPLEX, in non-monotonic matrix games, and demonstrate their state-of-the-art performance across 17 scenarios in SMAC-v2 and Overcooked.
翻译:多智能体信用分配是合作式多智能体强化学习(MARL)中的一个基础性挑战,其中一组智能体从共享的奖励信号中学习。个体-全局最大化(IGM)条件是一种广泛使用的多智能体信用分配原则,它要求由个体Q函数确定的联合动作能够最大化全局Q值。与此同时,最大熵原理已被用于增强MARL中的探索能力。然而,我们发现了现有最大熵MARL方法中存在的一个关键局限:局部策略与最大化全局Q值的联合策略之间会出现错位,从而导致违反IGM条件。为解决这一错位问题,我们提出了一种保序变换。在此基础上,我们引入了ME-IGM,这是一种新颖的最大熵MARL算法,它与任何满足IGM条件的信用分配机制兼容,同时享有最大熵探索的优势。我们在非单调矩阵游戏中实证评估了ME-IGM的两个变体:ME-QMIX和ME-QPLEX,并在SMAC-v2和Overcooked的17个场景中展示了其最先进的性能。