Online Multi-Agent Reinforcement Learning (MARL) is a prominent framework for efficient agent coordination. Crucially, enhancing policy expressiveness is pivotal for achieving superior performance. Diffusion-based generative models are well-positioned to meet this demand, having demonstrated remarkable expressiveness and multimodal representation in image generation and offline settings. Yet, their potential in online MARL remains largely under-explored. A major obstacle is that the intractable likelihoods of diffusion models impede entropy-based exploration and coordination. To tackle this challenge, we propose among the first \underline{O}nline off-policy \underline{MA}RL framework using \underline{D}iffusion policies (\textbf{OMAD}) to orchestrate coordination. Our key innovation is a relaxed policy objective that maximizes scaled joint entropy, facilitating effective exploration without relying on tractable likelihood. Complementing this, within the centralized training with decentralized execution (CTDE) paradigm, we employ a joint distributional value function to optimize decentralized diffusion policies. It leverages tractable entropy-augmented targets to guide the simultaneous updates of diffusion policies, thereby ensuring stable coordination. Extensive evaluations on MPE and MAMuJoCo establish our method as the new state-of-the-art across $10$ diverse tasks, demonstrating a remarkable $2.5\times$ to $5\times$ improvement in sample efficiency.
翻译:在线多智能体强化学习(MARL)是实现高效智能体协调的重要框架。其中,增强策略表达能力对于实现卓越性能至关重要。基于扩散的生成模型因其在图像生成和离线场景中展现出的卓越表达能力和多模态表征特性,恰好能够满足这一需求。然而,其在在线MARL中的潜力在很大程度上仍未得到充分探索。一个主要障碍在于扩散模型难以处理的似然性,这阻碍了基于熵的探索与协调。为应对这一挑战,我们首次提出了一个使用扩散策略的在线离策略多智能体强化学习框架(OMAD)来协调智能体行为。我们的核心创新在于一个松弛的策略目标,该目标最大化缩放后的联合熵,从而在不依赖可处理似然性的情况下促进有效探索。在此基础上,结合集中训练分散执行(CTDE)范式,我们采用联合分布价值函数来优化分散式扩散策略。该函数利用可处理的熵增强目标来指导扩散策略的同步更新,从而确保稳定的协调。在MPE和MAMuJoCo环境上的大量评估表明,我们的方法在10项多样化任务中均达到了新的最优性能,样本效率实现了2.5倍至5倍的显著提升。