In multi-agent reinforcement learning (MARL), many popular methods, such as VDN and QMIX, are susceptible to a critical multi-agent pathology known as relative overgeneralization (RO), which arises when the optimal joint action's utility falls below that of a sub-optimal joint action in cooperative tasks. RO can cause the agents to get stuck into local optima or fail to solve cooperative tasks that require significant coordination between agents within a given timestep. Recent value-based MARL algorithms such as QPLEX and WQMIX can overcome RO to some extent. However, our experimental results show that they can still fail to solve cooperative tasks that exhibit strong RO. In this work, we propose a novel approach called curriculum learning for relative overgeneralization (CURO) to better overcome RO. To solve a target task that exhibits strong RO, in CURO, we first fine-tune the reward function of the target task to generate source tasks that are tailored to the current ability of the learning agent and train the agent on these source tasks first. Then, to effectively transfer the knowledge acquired in one task to the next, we use a transfer learning method that combines value function transfer with buffer transfer, which enables more efficient exploration in the target task. We demonstrate that, when applied to QMIX, CURO overcomes severe RO problem and significantly improves performance, yielding state-of-the-art results in a variety of cooperative multi-agent tasks, including the challenging StarCraft II micromanagement benchmarks.
翻译:在多智能体强化学习(MARL)中,许多流行方法(如VDN和QMIX)容易受到一种关键的多智能体病理现象影响,即相对过泛化(RO)。当最优联合动作的效用低于次优联合动作的效用时,RO会在协作任务中出现。RO可能导致智能体陷入局部最优,或无法解决在给定时间步内需要智能体间高度协调的协作任务。近年来,基于价值的MARL算法(如QPLEX和WQMIX)能在一定程度上克服RO。然而,我们的实验结果表明,这些算法在面对表现出强RO的协作任务时仍可能失败。本文提出了一种名为相对过泛化课程学习(CURO)的新方法,以更好地克服RO。为解决具有强RO的目标任务,CURO首先调整目标任务奖励函数,生成适应学习智能体当前能力的源任务,并让智能体先在源任务上训练;然后,为有效将在一个任务中获取的知识迁移到下一个任务,我们采用了一种结合值函数迁移和缓冲区迁移的迁移学习方法,从而在目标任务中实现更高效的探索。我们证明,当将CURO应用于QMIX时,它能够克服严重的RO问题并显著提升性能,在多种协作多智能体任务(包括具有挑战性的星际争霸II微观管理基准测试)中取得了最先进的结果。