Recent advances in multi-agent reinforcement learning (MARL) allow agents to coordinate their behaviors in complex environments. However, common MARL algorithms still suffer from scalability and sparse reward issues. One promising approach to resolving them is automatic curriculum learning (ACL). ACL involves a student (curriculum learner) training on tasks of increasing difficulty controlled by a teacher (curriculum generator). Despite its success, ACL's applicability is limited by (1) the lack of a general student framework for dealing with the varying number of agents across tasks and the sparse reward problem, and (2) the non-stationarity of the teacher's task due to ever-changing student strategies. As a remedy for ACL, we introduce a novel automatic curriculum learning framework, Skilled Population Curriculum (SPC), which adapts curriculum learning to multi-agent coordination. Specifically, we endow the student with population-invariant communication and a hierarchical skill set, allowing it to learn cooperation and behavior skills from distinct tasks with varying numbers of agents. In addition, we model the teacher as a contextual bandit conditioned by student policies, enabling a team of agents to change its size while still retaining previously acquired skills. We also analyze the inherent non-stationarity of this multi-agent automatic curriculum teaching problem and provide a corresponding regret bound. Empirical results show that our method improves the performance, scalability and sample efficiency in several MARL environments.
翻译:多智能体强化学习(MARL)的最新进展使智能体能够在复杂环境中协调行为。然而,常见的MARL算法仍存在可扩展性和稀疏奖励问题。解决这些问题的一种有前景的方法是自动课程学习(ACL)。ACL涉及一个学生(课程学习者)在教师(课程生成器)控制的递增难度任务上进行训练。尽管取得了成功,但ACL的适用性受到以下限制:(1)缺乏通用的学生框架来处理跨任务中智能体数量变化和稀疏奖励问题,以及(2)由于学生策略的不断变化导致的教师任务非平稳性。作为ACL的改进方案,我们提出了一种新颖的自动课程学习框架——熟练种群课程(SPC),将课程学习适应于多智能体协调。具体而言,我们赋予学生群体不变通信能力和分层技能集,使其能够从不同智能体数量的不同任务中学习协作与行为技能。此外,我们将教师建模为基于学生策略的条件上下文赌博机,使得智能体团队能够在保留已习得技能的同时改变规模。我们还分析了该多智能体自动课程教学问题中固有的非平稳性,并提供了相应的遗憾界。实证结果表明,我们的方法在多个MARL环境中提升了性能、可扩展性和样本效率。