Cooperative multi-agent systems can be naturally used to model many real world problems, such as network packet routing and the coordination of autonomous vehicles. There is a great need for new reinforcement learning methods that can efficiently learn decentralised policies for such systems. To this end, we propose a new multi-agent actor-critic method called counterfactual multi-agent (COMA) policy gradients. COMA uses a centralised critic to estimate the Q-function and decentralised actors to optimise the agents' policies. In addition, to address the challenges of multi-agent credit assignment, it uses a counterfactual baseline that marginalises out a single agent's action, while keeping the other agents' actions fixed. COMA also uses a critic representation that allows the counterfactual baseline to be computed efficiently in a single forward pass. We evaluate COMA in the testbed of StarCraft unit micromanagement, using a decentralised variant with significant partial observability. COMA significantly improves average performance over other multi-agent actor-critic methods in this setting, and the best performing agents are competitive with state-of-the-art centralised controllers that get access to the full state.
翻译:协作多智能体系统能够自然地建模许多现实世界问题,例如网络数据包路由和自动驾驶车辆的协调。当前亟需能够为此类系统高效学习去中心化策略的新型强化学习方法。为此,我们提出了一种名为反事实多智能体(COMA)策略梯度的新型多智能体演员-评论家方法。COMA采用中心化评论家来估计Q函数,并利用去中心化的执行器来优化各智能体的策略。此外,为解决多智能体信用分配的挑战,该方法引入了一个反事实基线:在保持其他智能体行为固定的前提下,边缘化单个智能体的行为。COMA还采用了一种特殊的评论家表征方式,使得反事实基线能够通过单次前向传播被高效计算。我们在《星际争霸》单位微操测试平台上评估COMA,该平台采用具有显著部分可观测性的去中心化变体。在此设定下,COMA相较于其他多智能体演员-评论家方法显著提升了平均性能,其最佳表现智能体与能够访问完整状态的最先进中心化控制器相比具有竞争力。