Learning Generalizable Skills from Offline Multi-Task Data for Multi-Agent Cooperation

Learning cooperative multi-agent policy from offline multi-task data that can generalize to unseen tasks with varying numbers of agents and targets is an attractive problem in many scenarios. Although aggregating general behavior patterns among multiple tasks as skills to improve policy transfer is a promising approach, two primary challenges hinder the further advancement of skill learning in offline multi-task MARL. Firstly, extracting general cooperative behaviors from various action sequences as common skills lacks bringing cooperative temporal knowledge into them. Secondly, existing works only involve common skills and can not adaptively choose independent knowledge as task-specific skills in each task for fine-grained action execution. To tackle these challenges, we propose Hierarchical and Separate Skill Discovery (HiSSD), a novel approach for generalizable offline multi-task MARL through skill learning. HiSSD leverages a hierarchical framework that jointly learns common and task-specific skills. The common skills learn cooperative temporal knowledge and enable in-sample exploitation for offline multi-task MARL. The task-specific skills represent the priors of each task and achieve a task-guided fine-grained action execution. To verify the advancement of our method, we conduct experiments on multi-agent MuJoCo and SMAC benchmarks. After training the policy using HiSSD on offline multi-task data, the empirical results show that HiSSD assigns effective cooperative behaviors and obtains superior performance in unseen tasks.

翻译：从离线多任务数据中学习能够泛化到具有不同智能体数量和目标数量的未见任务中的协作多智能体策略，是许多场景中一个极具吸引力的问题。尽管将多个任务中的通用行为模式聚合为技能以提升策略迁移能力是一种有前景的方法，但离线多任务多智能体强化学习中的技能学习仍面临两大主要挑战。首先，从多样化的动作序列中提取通用协作行为作为公共技能，缺乏将协作时序知识融入其中。其次，现有工作仅涉及公共技能，无法在每个任务中自适应地选择独立知识作为任务特定技能，以实现细粒度的动作执行。为应对这些挑战，我们提出了分层与分离技能发现方法，这是一种通过技能学习实现可泛化离线多任务多智能体强化学习的新方法。该方法采用分层框架，联合学习公共技能和任务特定技能。公共技能学习协作时序知识，并实现离线多任务多智能体强化学习的样本内利用。任务特定技能则表征每个任务的先验，实现任务引导的细粒度动作执行。为验证方法的先进性，我们在多智能体MuJoCo和SMAC基准测试上进行了实验。使用该方法在离线多任务数据上训练策略后，实证结果表明，该方法能分配有效的协作行为，并在未见任务中取得优越性能。