Efficient exploration is crucial in cooperative multi-agent reinforcement learning (MARL), especially in sparse-reward settings. However, due to the reliance on the unimodal policy, existing methods are prone to falling into the local optima, hindering the effective exploration of better policies. Furthermore, in sparse-reward settings, each agent tends to receive a scarce reward, which poses significant challenges to inter-agent cooperation. This not only increases the difficulty of policy learning but also degrades the overall performance of multi-agent tasks. To address these issues, we propose a Consistency Policy with Intention Guidance (CPIG), with two primary components: (a) introducing a multimodal policy to enhance the agent's exploration capability, and (b) sharing the intention among agents to foster agent cooperation. For component (a), CPIG incorporates a Consistency model as the policy, leveraging its multimodal nature and stochastic characteristics to facilitate exploration. Regarding component (b), we introduce an Intention Learner to deduce the intention on the global state from each agent's local observation. This intention then serves as a guidance for the Consistency Policy, promoting cooperation among agents. The proposed method is evaluated in multi-agent particle environments (MPE) and multi-agent MuJoCo (MAMuJoCo). Empirical results demonstrate that our method not only achieves comparable performance to various baselines in dense-reward environments but also significantly enhances performance in sparse-reward settings, outperforming state-of-the-art (SOTA) algorithms by 20%.
翻译:在合作式多智能体强化学习(MARL)中,高效探索至关重要,尤其是在稀疏奖励环境中。然而,由于对单峰策略的依赖,现有方法容易陷入局部最优,阻碍了对更优策略的有效探索。此外,在稀疏奖励设置下,每个智能体往往只能获得极少奖励,这给智能体间的合作带来了巨大挑战。这不仅增加了策略学习的难度,也降低了多智能体任务的整体性能。为解决这些问题,我们提出了一种带有意图引导的一致性策略(CPIG),其包含两个核心组件:(a)引入多峰策略以增强智能体的探索能力;(b)在智能体间共享意图以促进合作。对于组件(a),CPIG采用一致性模型作为策略,利用其多峰特性和随机性来促进探索。关于组件(b),我们引入了一个意图学习器,从每个智能体的局部观测中推断出全局状态的意图。该意图随后作为一致性策略的引导,促进智能体间的协作。所提方法在多智能体粒子环境(MPE)和多智能体MuJoCo(MAMuJoCo)中进行了评估。实验结果表明,我们的方法不仅在密集奖励环境中达到了与多种基线相当的性能,而且在稀疏奖励设置中显著提升了性能,以20%的优势超越了当前最先进(SOTA)算法。