In multi-agent reinforcement learning, centralized training with decentralized execution (CTDE) methods typically assume that agents make decisions based on their local observations independently, which may not lead to a correlated joint policy with coordination. Coordination can be explicitly encouraged during training and individual policies can be trained to imitate the correlated joint policy. However, this may lead to an \textit{asymmetric learning failure} due to the observation mismatch between the joint and individual policies. Inspired by the concept of correlated equilibrium, we introduce a \textit{strategy modification} called AgentMixer that allows agents to correlate their policies. AgentMixer combines individual partially observable policies into a joint fully observable policy non-linearly. To enable decentralized execution, we introduce \textit{Individual-Global-Consistency} to guarantee mode consistency during joint training of the centralized and decentralized policies and prove that AgentMixer converges to an $\epsilon$-approximate Correlated Equilibrium. In the Multi-Agent MuJoCo, SMAC-v2, Matrix Game, and Predator-Prey benchmarks, AgentMixer outperforms or matches state-of-the-art methods.
翻译:在多智能体强化学习中,采用集中训练与分散执行(CTDE)的方法通常假设智能体基于各自的局部观测独立决策,这可能导致无法形成具有协同性的关联联合策略。在训练过程中可以显式地鼓励协同,并通过训练个体策略来模仿关联联合策略。然而,由于联合策略与个体策略之间存在观测不匹配,这可能导致一种“非对称学习失败”现象。受关联均衡概念的启发,我们提出了一种称为 AgentMixer 的“策略修正”方法,使智能体能够关联其策略。AgentMixer 通过非线性方式将个体部分可观测策略组合为联合完全可观测策略。为实现分散执行,我们引入了“个体-全局一致性”约束,以确保集中式策略与分散式策略在联合训练过程中的模式一致性,并证明 AgentMixer 能够收敛至 $\epsilon$-近似关联均衡。在 Multi-Agent MuJoCo、SMAC-v2、矩阵博弈以及捕食者-猎物等基准测试中,AgentMixer 的表现优于或匹配当前最先进的方法。