Multi-Agent Reinforcement Learning (MARL) algorithms face the challenge of efficient exploration due to the exponential increase in the size of the joint state-action space. While demonstration-guided learning has proven beneficial in single-agent settings, its direct applicability to MARL is hindered by the practical difficulty of obtaining joint expert demonstrations. In this work, we introduce a novel concept of personalized expert demonstrations, tailored for each individual agent or, more broadly, each individual type of agent within a heterogeneous team. These demonstrations solely pertain to single-agent behaviors and how each agent can achieve personal goals without encompassing any cooperative elements, thus naively imitating them will not achieve cooperation due to potential conflicts. To this end, we propose an approach that selectively utilizes personalized expert demonstrations as guidance and allows agents to learn to cooperate, namely personalized expert-guided MARL (PegMARL). This algorithm utilizes two discriminators: the first provides incentives based on the alignment of policy behavior with demonstrations, and the second regulates incentives based on whether the behavior leads to the desired objective. We evaluate PegMARL using personalized demonstrations in both discrete and continuous environments. The results demonstrate that PegMARL learns near-optimal policies even when provided with suboptimal demonstrations, and outperforms state-of-the-art MARL algorithms in solving coordinated tasks. We also showcase PegMARL's capability to leverage joint demonstrations in the StarCraft scenario and converge effectively even with demonstrations from non-co-trained policies.
翻译:多智能体强化学习(MARL)算法面临联合状态-动作空间规模指数级增长带来的高效探索挑战。尽管演示引导学习在单智能体场景中已证明其有效性,但获取联合专家演示的实际困难阻碍了该方法在MARL中的直接应用。本文提出个性化专家演示这一全新概念,为每个独立智能体——更广泛而言,为异构团队中每种类型的智能体——量身定制演示数据。这些演示仅涉及单智能体行为及其实现个人目标的方式,不包含任何协作元素,因此直接模仿将因潜在冲突而无法实现协作。为此,我们提出选择性利用个性化专家演示作为引导、使智能体能够学习协作的方法,即个性化专家引导的MARL(PegMARL)。该算法采用两个判别器:第一个根据策略行为与演示的一致性提供激励,第二个基于行为是否导向预期目标来调节激励。我们在离散与连续环境中使用个性化演示对PegMARL进行评估。结果表明,即使提供次优演示,PegMARL也能学习近似最优策略,并在解决协同任务时优于最先进的MARL算法。我们还展示了PegMARL在星际争霸场景中利用联合演示的能力,即使在处理非协同训练策略的演示时也能有效收敛。