Value decomposition (VD) methods have achieved remarkable success in cooperative multi-agent reinforcement learning (MARL). However, their reliance on the max operator for temporal-difference (TD) target calculation leads to systematic Q-value overestimation. This issue is particularly severe in MARL due to the combinatorial explosion of the joint action space, which often results in unstable learning and suboptimal policies. To address this problem, we propose QSIM, a similarity weighted Q-learning framework that reconstructs the TD target using action similarity. Instead of using the greedy joint action directly, QSIM forms a similarity weighted expectation over a structured near-greedy joint action space. This formulation allows the target to integrate Q-values from diverse yet behaviorally related actions while assigning greater influence to those that are more similar to the greedy choice. By smoothing the target with structurally relevant alternatives, QSIM effectively mitigates overestimation and improves learning stability. Extensive experiments demonstrate that QSIM can be seamlessly integrated with various VD methods, consistently yielding superior performance and stability compared to the original algorithms. Furthermore, empirical analysis confirms that QSIM significantly mitigates the systematic value overestimation in MARL. Code is available at https://github.com/MaoMaoLYJ/pymarl-qsim.
翻译:值分解(VD)方法在合作式多智能体强化学习(MARL)中取得了显著成功。然而,这些方法在计算时序差分(TD)目标时对最大化算子的依赖,导致了系统性的Q值过高估计。由于联合动作空间的组合爆炸,这一问题在MARL中尤为严重,常常导致学习过程不稳定并产生次优策略。为解决此问题,我们提出了QSIM,一种基于相似性加权的Q学习框架,该框架利用动作相似性重构TD目标。QSIM不直接使用贪婪联合动作,而是在一个结构化的近贪婪联合动作空间上构建一个相似性加权期望。这种构建方式使得目标能够整合来自多样化但在行为上相关的动作的Q值,同时为与贪婪选择更相似的动作分配更大的权重。通过用结构相关的替代动作平滑目标,QSIM有效地缓解了过高估计并提升了学习稳定性。大量实验表明,QSIM可以无缝集成到多种VD方法中,与原始算法相比,始终能产生更优的性能和稳定性。此外,实证分析证实,QSIM显著缓解了MARL中的系统性价值过高估计。代码可在 https://github.com/MaoMaoLYJ/pymarl-qsim 获取。