Multi-Agent Reinforcement Learning (MARL) algorithms face the challenge of efficient exploration due to the exponential increase in the size of the joint state-action space. While demonstration-guided learning has proven beneficial in single-agent settings, its direct applicability to MARL is hindered by the practical difficulty of obtaining joint expert demonstrations. In this work, we introduce a novel concept of personalized expert demonstrations, tailored for each individual agent or, more broadly, each individual type of agent within a heterogeneous team. These demonstrations solely pertain to single-agent behaviors and how each agent can achieve personal goals without encompassing any cooperative elements, thus naively imitating them will not achieve cooperation due to potential conflicts. To this end, we propose an approach that selectively utilizes personalized expert demonstrations as guidance and allows agents to learn to cooperate, namely personalized expert-guided MARL (PegMARL). This algorithm utilizes two discriminators: the first provides incentives based on the alignment of individual agent behavior with demonstrations, and the second regulates incentives based on whether the behaviors lead to the desired outcome. We evaluate PegMARL using personalized demonstrations in both discrete and continuous environments. The results demonstrate that PegMARL learns near-optimal policies even when provided with suboptimal demonstrations and outperforms state-of-the-art MARL algorithms in solving coordinated tasks. We also showcase PegMARL's capability of leveraging joint demonstrations in the StarCraft scenario and converging effectively even with demonstrations from non-co-trained policies.
翻译:多智能体强化学习(MARL)算法因联合状态-动作空间规模呈指数级增长而面临高效探索的挑战。尽管演示引导学习在单智能体场景中已被证明具有优势,但其直接应用于MARL受到获取联合专家演示的实际困难所阻碍。本研究提出一种新颖的个性化专家演示概念,专为异构团队中的每个独立智能体或更广义的每种智能体类型定制。这些演示仅涉及单智能体行为及个体如何实现自身目标,不包含任何协作要素,因此简单模仿将因潜在冲突而无法达成协作。为此,我们提出一种选择性利用个性化专家演示作为引导、使智能体学会协作的方法,即个性化专家引导多智能体强化学习(PegMARL)。该算法采用两个判别器:第一个根据个体智能体行为与演示的匹配度提供激励,第二个则根据行为是否导向期望结果来调节激励。我们在离散与连续环境中使用个性化演示对PegMARL进行评估。结果表明,即使提供次优演示,PegMARL仍能学习接近最优的策略,并在解决协同任务方面优于最先进的多智能体强化学习算法。我们还在《星际争霸》场景中展示了PegMARL利用联合演示的能力,即使面对非协同训练策略生成的演示也能有效收敛。