结构化协同多智能体强化学习：贝叶斯网络视角 (Structured Cooperative Multi-Agent Reinforcement Learning: a Bayesian Network Perspective)

The empirical success of multi-agent reinforcement learning (MARL) has motivated the search for more efficient and scalable algorithms for large scale multi-agent systems. However, existing state-of-the-art algorithms do not fully exploit inter-agent coupling information to develop MARL algorithms. In this paper, we propose a systematic approach to leverage structures in the inter-agent couplings for efficient model-free reinforcement learning. We model the cooperative MARL problem via a Bayesian network and characterize the subset of agents, termed as the value dependency set, whose information is required by each agent to estimate its local action value function exactly. Moreover, we propose a partially decentralized training decentralized execution (P-DTDE) paradigm based on the value dependency set. We theoretically establish that the total variance of our P-DTDE policy gradient estimator is less than the centralized training decentralized execution (CTDE) policy gradient estimator. We derive a multi-agent policy gradient theorem based on the P-DTDE scheme and develop a scalable actor-critic algorithm. We demonstrate the efficiency and scalability of the proposed algorithm on multi-warehouse resource allocation and multi-zone temperature control examples. For dense value dependency sets, we propose an approximation scheme based on truncation of the Bayesian network and empirically show that it achieves a faster convergence than the exact value dependence set for applications with a large number of agents.

翻译：多智能体强化学习（MARL）的经验成功推动了对大规模多智能体系统中更高效、可扩展算法的探索。然而，现有最先进的算法未能充分利用智能体间的耦合信息来开发MARL算法。本文提出一种系统性方法，利用智能体间耦合结构实现高效的无模型强化学习。我们通过贝叶斯网络对协同MARL问题进行建模，并定义了每个智能体为精确估计其局部动作价值函数所需的信息子集——称为价值依赖集。基于该价值依赖集，我们进一步提出部分去中心化训练-去中心化执行（P-DTDE）范式。理论上我们证明了P-DTDE策略梯度估计器的总方差小于中心化训练-去中心化执行（CTDE）策略梯度估计器。基于P-DTDE框架，我们推导出多智能体策略梯度定理，并开发了可扩展的演员-评论家算法。通过在多仓库资源分配和多区域温度控制案例中的实验，验证了所提算法的高效性与可扩展性。针对稠密价值依赖集，我们提出基于贝叶斯网络截断的近似方案，并通过实证表明在智能体数量庞大的应用场景中，该方案比精确价值依赖集具有更快的收敛速度。