Real-world cooperation often requires intensive coordination among agents simultaneously. This task has been extensively studied within the framework of cooperative multi-agent reinforcement learning (MARL), and value decomposition methods are among those cutting-edge solutions. However, traditional methods that learn the value function as a monotonic mixing of per-agent utilities cannot solve the tasks with non-monotonic returns. This hinders their application in generic scenarios. Recent methods tackle this problem from the perspective of implicit credit assignment by learning value functions with complete expressiveness or using additional structures to improve cooperation. However, they are either difficult to learn due to large joint action spaces or insufficient to capture the complicated interactions among agents which are essential to solving tasks with non-monotonic returns. To address these problems, we propose a novel explicit credit assignment method to address the non-monotonic problem. Our method, Adaptive Value decomposition with Greedy Marginal contribution (AVGM), is based on an adaptive value decomposition that learns the cooperative value of a group of dynamically changing agents. We first illustrate that the proposed value decomposition can consider the complicated interactions among agents and is feasible to learn in large-scale scenarios. Then, our method uses a greedy marginal contribution computed from the value decomposition as an individual credit to incentivize agents to learn the optimal cooperative policy. We further extend the module with an action encoder to guarantee the linear time complexity for computing the greedy marginal contribution. Experimental results demonstrate that our method achieves significant performance improvements in several non-monotonic domains.
翻译:现实世界中的合作往往需要智能体之间同时进行密集协调。这一任务已在合作式多智能体强化学习(MARL)框架中得到广泛研究,而价值分解方法是其中前沿解决方案之一。然而,传统方法将价值函数学习为各智能体效用的单调混合函数,无法解决具有非单调回报的任务,这限制了其在通用场景中的应用。近年来的方法从隐式信用分配的角度出发,通过学习具有完全表达能力的价值函数或利用附加结构增强合作来应对该问题。但这类方法要么因联合动作空间过大而难以学习,要么不足以捕捉智能体间复杂的交互——而后者正是解决非单调回报任务的关键。针对这些问题,我们提出了一种新颖的显式信用分配方法以解决非单调问题。我们的方法——自适应贪婪边际贡献价值分解(AVGM)——基于自适应价值分解,能够学习动态变化智能体群的合作价值。我们首先证明所提出的价值分解既能考虑智能体间的复杂交互,又能在大规模场景中实现有效学习。然后,该方法利用从价值分解中计算的贪婪边际贡献作为个体信用,激励智能体学习最优合作策略。我们进一步扩展该模块,加入动作编码器以保证计算贪婪边际贡献的线性时间复杂度。实验结果表明,我们的方法在多个非单调域中取得了显著的性能提升。