Centralized training with decentralized execution (CTDE) is widely employed to stabilize partially observable multi-agent reinforcement learning (MARL) by utilizing a centralized value function during training. However, existing methods typically assume that agents make decisions based on their local observations independently, which may not lead to a correlated joint policy with sufficient coordination. Inspired by the concept of correlated equilibrium, we propose to introduce a \textit{strategy modification} to provide a mechanism for agents to correlate their policies. Specifically, we present a novel framework, AgentMixer, which constructs the joint fully observable policy as a non-linear combination of individual partially observable policies. To enable decentralized execution, one can derive individual policies by imitating the joint policy. Unfortunately, such imitation learning can lead to \textit{asymmetric learning failure} caused by the mismatch between joint policy and individual policy information. To mitigate this issue, we jointly train the joint policy and individual policies and introduce \textit{Individual-Global-Consistency} to guarantee mode consistency between the centralized and decentralized policies. We then theoretically prove that AgentMixer converges to an $\epsilon$-approximate Correlated Equilibrium. The strong experimental performance on three MARL benchmarks demonstrates the effectiveness of our method.
翻译:集中训练与分散执行(CTDE)通过在训练过程中利用集中式价值函数,被广泛用于稳定部分可观测的多智能体强化学习(MARL)。然而,现有方法通常假设智能体基于各自局部观测独立决策,这可能导致无法形成具有充分协调性的关联联合策略。受关联均衡概念的启发,我们提出引入一种策略修正机制,使智能体能够关联其策略。具体而言,我们提出一个名为AgentMixer的新框架,该框架将联合完全可观测策略构建为单个部分可观测策略的非线性组合。为实现分散执行,可通过模仿联合策略推导出个体策略。但此类模仿学习可能因联合策略与个体策略信息不匹配而导致非对称学习失败。为缓解该问题,我们联合训练联合策略与个体策略,并引入个体-全局一致性保证集中式策略与分散式策略之间的模态一致性。随后我们理论上证明AgentMixer能够收敛到ε-近似关联均衡。在三个MARL基准测试上的强劲实验性能表明了该方法的有效性。