Multi-agent Deep Covering Skill Discovery

The use of skills (a.k.a., options) can greatly accelerate exploration in reinforcement learning, especially when only sparse reward signals are available. While option discovery methods have been proposed for individual agents, in multi-agent reinforcement learning settings, discovering collaborative options that can coordinate the behavior of multiple agents and encourage them to visit the under-explored regions of their joint state space has not been considered. In this case, we propose Multi-agent Deep Covering Option Discovery, which constructs the multi-agent options through minimizing the expected cover time of the multiple agents' joint state space. Also, we propose a novel framework to adopt the multi-agent options in the MARL process. In practice, a multi-agent task can usually be divided into some sub-tasks, each of which can be completed by a sub-group of the agents. Therefore, our algorithm framework first leverages an attention mechanism to find collaborative agent sub-groups that would benefit most from coordinated actions. Then, a hierarchical algorithm, namely HA-MSAC, is developed to learn the multi-agent options for each sub-group to complete their sub-tasks first, and then to integrate them through a high-level policy as the solution of the whole task. This hierarchical option construction allows our framework to strike a balance between scalability and effective collaboration among the agents. The evaluation based on multi-agent collaborative tasks shows that the proposed algorithm can effectively capture the agent interactions with the attention mechanism, successfully identify multi-agent options, and significantly outperforms prior works using single-agent options or no options, in terms of both faster exploration and higher task rewards.

翻译：技能（亦称选项）可极大加速强化学习中的探索过程，尤其在仅存在稀疏奖励信号的情况下。尽管已有针对单智能体的选项发现方法被提出，但在多智能体强化学习环境中，如何发现能够协调多个智能体行为、鼓励其共同探索联合状态空间中未充分探索区域的协作式选项，尚未得到充分研究。针对该问题，本文提出多智能体深度覆盖选项发现方法，通过最小化多智能体联合状态空间的期望覆盖时间来构建多智能体选项。同时，我们提出一个将多智能体选项应用于MARL过程的新框架。实际应用中，多智能体任务通常可分解为若干子任务，每个子任务可由智能体子群完成。因此，本算法框架首先利用注意力机制识别出能从协作行为中获益最大的协作智能体子群；随后，开发了名为HA-MSAC的分层算法，为每个子群学习多智能体选项以优先完成其子任务，并通过高层策略将其整合为完整任务的解决方案。这种分层选项构建方式使我们的框架能够平衡可扩展性与智能体间的有效协作。基于多智能体协作任务的评估表明，所提算法能够借助注意力机制有效捕捉智能体交互，成功识别多智能体选项，并在探索效率与任务奖励两方面均显著优于采用单智能体选项或无选项的既有方法。