Many recent successful off-policy multi-agent reinforcement learning (MARL) algorithms for cooperative partially observable environments focus on finding factorized value functions, leading to convoluted network structures. Building on the structure of independent Q-learners, our LAN algorithm takes a radically different approach, leveraging a dueling architecture to learn for each agent a decentralized best-response policies via individual advantage functions. The learning is stabilized by a centralized critic whose primary objective is to reduce the moving target problem of the individual advantages. The critic, whose network's size is independent of the number of agents, is cast aside after learning. Evaluation on the StarCraft II multi-agent challenge benchmark shows that LAN reaches state-of-the-art performance and is highly scalable with respect to the number of agents, opening up a promising alternative direction for MARL research.
翻译:许多近期成功的用于合作部分可观察环境的离策略多智能体强化学习算法专注于寻找分解的价值函数,这导致了复杂的网络结构。基于独立Q学习器的结构,我们的LAN算法采取了截然不同的方法,利用决斗架构为每个智能体通过个体优势函数学习去中心化的最优响应策略。学习过程通过一个集中式评论家稳定,其主要目标是减少个体优势的移动目标问题。该评论家的网络规模与智能体数量无关,在学习后被弃用。在星际争霸II多智能体挑战基准上的评估表明,LAN达到了最先进的性能,并且随着智能体数量的增加具有高度可扩展性,为MARL研究开辟了有前景的替代方向。