Value function factorization methods are commonly used in cooperative multi-agent reinforcement learning, with QMIX receiving significant attention. Many QMIX-based methods introduce monotonicity constraints between the joint action value and individual action values to achieve decentralized execution. However, such constraints limit the representation capacity of value factorization, restricting the joint action values it can represent and hindering the learning of the optimal policy. To address this challenge, we propose the Potentially Optimal joint actions Weighted QMIX (POWQMIX) algorithm, which recognizes the potentially optimal joint actions and assigns higher weights to the corresponding losses of these joint actions during training. We theoretically prove that with such a weighted training approach the optimal policy is guaranteed to be recovered. Experiments in matrix games, predator-prey, and StarCraft II Multi-Agent Challenge environments demonstrate that our algorithm outperforms the state-of-the-art value-based multi-agent reinforcement learning methods.
翻译:值函数分解方法广泛用于合作多智能体强化学习,其中QMIX受到广泛关注。许多基于QMIX的方法通过引入联合动作值与个体动作值之间的单调性约束实现分散执行。然而,此类约束限制了值分解的表示能力,使其可表示的联合动作值受限,并阻碍最优策略的学习。为应对这一挑战,我们提出潜在最优联合动作加权QMIX(POWQMIX)算法,该算法能够识别潜在最优联合动作,并在训练过程中为这些联合动作的对应损失赋予更高权重。我们理论证明,通过这种加权训练方法,可以保证恢复最优策略。在矩阵博弈、捕食者-猎物以及星际争霸II多智能体挑战环境中的实验表明,我们的算法优于当前最先进的基于值函数的多智能体强化学习方法。