Executing actions in a correlated manner is a common strategy for human coordination that often leads to better cooperation, which is also potentially beneficial for cooperative multi-agent reinforcement learning (MARL). However, the recent success of MARL relies heavily on the convenient paradigm of purely decentralized execution, where there is no action correlation among agents for scalability considerations. In this work, we introduce a Bayesian network to inaugurate correlations between agents' action selections in their joint policy. Theoretically, we establish a theoretical justification for why action dependencies are beneficial by deriving the multi-agent policy gradient formula under such a Bayesian network joint policy and proving its global convergence to Nash equilibria under tabular softmax policy parameterization in cooperative Markov games. Further, by equipping existing MARL algorithms with a recent method of differentiable directed acyclic graphs (DAGs), we develop practical algorithms to learn the context-aware Bayesian network policies in scenarios with partial observability and various difficulty. We also dynamically decrease the sparsity of the learned DAG throughout the training process, which leads to weakly or even purely independent policies for decentralized execution. Empirical results on a range of MARL benchmarks show the benefits of our approach.
翻译:以关联方式执行动作是人类协调中常见的一种策略,往往能带来更好的合作效果,这也可能对协同多智能体强化学习(MARL)有益。然而,近年来MARL的成功高度依赖纯分散式执行的便捷范式,其中为考虑可扩展性,智能体之间不存在动作关联。在本工作中,我们引入贝叶斯网络以在智能体的联合策略中建立动作选择之间的相关性。理论上,我们通过推导此类贝叶斯网络联合策略下的多智能体策略梯度公式,并证明其在表格型softmax策略参数化下于协同马尔可夫博弈中全局收敛至纳什均衡,为动作依赖的有效性提供了理论依据。进一步地,通过将现有MARL算法与最新的可微有向无环图(DAG)方法相结合,我们开发了实用算法,用于在部分可观测及不同难度场景中学习上下文感知的贝叶斯网络策略。同时,我们在训练过程中动态降低所学DAG的稀疏性,从而得到弱关联甚至完全独立的策略以支持分散式执行。在多个MARL基准测试上的实验结果表明了本方法的优势。