In this paper, we propose a new mutual information framework for multi-agent reinforcement learning to enable multiple agents to learn coordinated behaviors by regularizing the accumulated return with the simultaneous mutual information between multi-agent actions. By introducing a latent variable to induce nonzero mutual information between multi-agent actions and applying a variational bound, we derive a tractable lower bound on the considered MMI-regularized objective function. The derived tractable objective can be interpreted as maximum entropy reinforcement learning combined with uncertainty reduction of other agents actions. Applying policy iteration to maximize the derived lower bound, we propose a practical algorithm named variational maximum mutual information multi-agent actor-critic, which follows centralized learning with decentralized execution. We evaluated VM3-AC for several games requiring coordination, and numerical results show that VM3-AC outperforms other MARL algorithms in multi-agent tasks requiring high-quality coordination.
翻译:本文提出了一种新的互信息框架用于多智能体强化学习,通过利用多智能体动作之间的同步互信息对累积回报进行正则化,使多个智能体能够学习协调行为。通过引入隐变量诱导多智能体动作间的非零互信息,并应用变分界,我们推导出所考虑的MMI正则化目标函数的一个可处理下界。该可处理目标可解释为结合了其他智能体动作不确定性降低的最大熵强化学习。通过策略迭代最大化该下界,我们提出了一种名为变分最大互信息多智能体演员-评论家的实用算法,该算法遵循集中式学习与分散式执行范式。我们针对若干需要协调的游戏任务评估了VM3-AC,数值结果表明,在需要高质量协调的多智能体任务中,VM3-AC优于其他MARL算法。