A wide range of real-world applications can be formulated as Multi-Agent Path Finding (MAPF) problem, where the goal is to find collision-free paths for multiple agents with individual start and goal locations. State-of-the-art MAPF solvers are mainly centralized and depend on global information, which limits their scalability and flexibility regarding changes or new maps that would require expensive replanning. Multi-agent reinforcement learning (MARL) offers an alternative way by learning decentralized policies that can generalize over a variety of maps. While there exist some prior works that attempt to connect both areas, the proposed techniques are heavily engineered and very complex due to the integration of many mechanisms that limit generality and are expensive to use. We argue that much simpler and general approaches are needed to bring the areas of MARL and MAPF closer together with significantly lower costs. In this paper, we propose Confidence-based Auto-Curriculum for Team Update Stability (CACTUS) as a lightweight MARL approach to MAPF. CACTUS defines a simple reverse curriculum scheme, where the goal of each agent is randomly placed within an allocation radius around the agent's start location. The allocation radius increases gradually as all agents improve, which is assessed by a confidence-based measure. We evaluate CACTUS in various maps of different sizes, obstacle densities, and numbers of agents. Our experiments demonstrate better performance and generalization capabilities than state-of-the-art MARL approaches with less than 600,000 trainable parameters, which is less than 5% of the neural network size of current MARL approaches to MAPF.
翻译:现实世界中的许多应用可以归结为多智能体路径规划(MAPF)问题,其目标是为具有各自起始点与目标位置的多个智能体寻找无冲突路径。当前最先进的MAPF求解器主要采用集中式方法并依赖全局信息,这限制了其可扩展性和灵活性——当环境发生变化或需要处理新地图时,往往需要昂贵的重新规划。多智能体强化学习(MARL)提供了一种替代方案,通过学习能够泛化到多种地图的分散式策略。尽管已有部分前期研究尝试连接这两个领域,但所提出的技术由于整合了众多机制而高度工程化且非常复杂,导致通用性受限且使用成本高昂。我们认为,需要更简单通用的方法,以显著降低的成本将MARL与MAPF领域更紧密地结合起来。本文提出基于置信度自动课程学习的团队更新稳定性(CACTUS)方法,作为解决MAPF问题的轻量级MARL方案。CACTUS定义了一种简单的反向课程机制:每个智能体的目标被随机放置在以其起始位置为中心的分配半径内,该分配半径随所有智能体能力的提升(通过基于置信度的度量评估)逐步增大。我们在不同规模、障碍密度和智能体数量的多种地图上评估了CACTUS。实验结果表明,与当前最先进的MARL方法相比,CACTUS在参数量不足60万(不到现有MARL-MAPF方法神经网络规模的5%)的情况下,展现出更优的性能和更强的泛化能力。