Efficient exploration is crucial in cooperative multi-agent reinforcement learning (MARL), especially in sparse-reward settings. However, due to the reliance on the unimodal policy, existing methods are prone to falling into the local optima, hindering the effective exploration of better policies. Furthermore, tackling multi-agent tasks in complex environments requires cooperation during exploration, posing substantial challenges for MARL methods. To address these issues, we propose a Consistency Policy with consEnsus Guidance (CPEG), with two primary components: (a) introducing a multimodal policy to enhance exploration capabilities, and (b) sharing the consensus among agents to foster agent cooperation. For component (a), CPEG incorporates a Consistency model as the policy, leveraging its multimodal nature and stochastic characteristics to facilitate exploration. Regarding component (b), CPEG introduces a Consensus Learner to deduce the consensus on the global state from local observations. This consensus then serves as a guidance for the Consistency Policy, promoting cooperation among agents. The proposed method is evaluated in multi-agent particle environments (MPE) and multi-agent MuJoCo (MAMuJoCo), and empirical results indicate that CPEG not only achieves improvements in sparse-reward settings but also matches the performance of baselines in dense-reward environments.
翻译:高效探索在协作式多智能体强化学习(MARL)中至关重要,尤其是在稀疏奖励场景下。然而,由于现有方法依赖于单模态策略,容易陷入局部最优,从而阻碍了对更优策略的有效探索。此外,在复杂环境中处理多智能体任务需要在探索过程中进行协作,这为MARL方法带来了重大挑战。为解决这些问题,我们提出了一种基于共识引导的一致性策略(CPEG),其包含两个核心组成部分:(a)引入多模态策略以增强探索能力;(b)在智能体间共享共识以促进智能体协作。对于组件(a),CPEG采用一致性模型作为策略,利用其多模态特性和随机性特征来促进探索。关于组件(b),CPEG引入了一个共识学习器,用于从局部观测中推导出关于全局状态的共识。该共识随后作为一致性策略的引导,促进智能体间的协作。所提方法在多智能体粒子环境(MPE)和多智能体MuJoCo(MAMuJoCo)中进行了评估,实验结果表明CPEG不仅在稀疏奖励场景下取得了性能提升,在密集奖励环境中的表现也与基线方法相当。