In cooperative multi-agent reinforcement learning (MARL), combining value decomposition with actor-critic enables agents to learn stochastic policies, which are more suitable for the partially observable environment. Given the goal of learning local policies that enable decentralized execution, agents are commonly assumed to be independent of each other, even in centralized training. However, such an assumption may prohibit agents from learning the optimal joint policy. To address this problem, we explicitly take the dependency among agents into centralized training. Although this leads to the optimal joint policy, it may not be factorized for decentralized execution. Nevertheless, we theoretically show that from such a joint policy, we can always derive another joint policy that achieves the same optimality but can be factorized for decentralized execution. To this end, we propose multi-agent conditional policy factorization (MACPF), which takes more centralized training but still enables decentralized execution. We empirically verify MACPF in various cooperative MARL tasks and demonstrate that MACPF achieves better performance or faster convergence than baselines. Our code is available at https://github.com/PKU-RL/FOP-DMAC-MACPF.
翻译:在合作式多智能体强化学习(MARL)中,将值分解与演员-评论家相结合,使智能体能够学习随机策略,这类策略更适用于部分可观测环境。鉴于学习局部策略以支持分散执行的目标,即便在集中训练中,通常也假设智能体相互独立。然而,这一假设可能阻碍智能体学习最优联合策略。为解决此问题,我们明确将智能体间的依赖关系纳入集中训练。尽管这能产生最优联合策略,但该策略可能无法分解以支持分散执行。不过,我们从理论上证明,总能从该联合策略衍生出另一个同样最优且可分解的联合策略,从而支持分散执行。为此,我们提出多智能体条件策略分解(MACPF)方法,该方法进行更集中的训练,但仍能实现分散执行。我们在多种合作式MARL任务中进行了实证验证,结果表明MACPF相比基线方法实现了更优的性能或更快的收敛速度。我们的代码已开源,地址为:https://github.com/PKU-RL/FOP-DMAC-MACPF。