Cooperative multi-agent reinforcement learning (MARL) commonly adopts centralized training with decentralized execution (CTDE), where centralized critics leverage global information to guide decentralized actors. However, centralized-decentralized mismatch (CDM) arises when the suboptimal behavior of one agent degrades others' learning. Prior approaches mitigate CDM through value decomposition, but linear decompositions allow per-agent gradients at the cost of limited expressiveness, while nonlinear decompositions improve representation but require centralized gradients, reintroducing CDM. To overcome this trade-off, we propose the multi-agent cross-entropy method (MCEM), combined with monotonic nonlinear critic decomposition (NCD). MCEM updates policies by increasing the probability of high-value joint actions, thereby excluding suboptimal behaviors. For sample efficiency, we extend off-policy learning with a modified k-step return and Retrace. Analysis and experiments demonstrate that MCEM outperforms state-of-the-art methods across both continuous and discrete action benchmarks.
翻译:协作式多智能体强化学习通常采用集中训练与分散执行的框架,其中集中式评论家利用全局信息指导分散式执行者。然而,当某一智能体的次优行为损害其他智能体的学习时,便会出现集中-分散失配问题。现有方法通过价值分解缓解该问题,但线性分解虽能保留各智能体梯度,表达能力受限;非线性分解虽提升表征能力,却需集中式梯度,重新引入失配。为克服此权衡,本文提出多智能体交叉熵方法,结合单调非线性评论家分解。该方法通过增加高价值联合动作的概率来更新策略,从而排除次优行为。为提升样本效率,我们采用改进的k步回报与回溯机制扩展离策略学习。分析与实验表明,该方法在连续与离散动作基准测试中均优于现有先进方法。