更巧而非更勤：结合CS-PIBT的简单模仿学习在大规模多智能体路径规划中胜过大规模模仿学习 (Work Smarter Not Harder: Simple Imitation Learning with CS-PIBT Outperforms Large Scale Imitation Learning for MAPF)

Multi-Agent Path Finding (MAPF) is the problem of effectively finding efficient collision-free paths for a group of agents in a shared workspace. The MAPF community has largely focused on developing high-performance heuristic search methods. Recently, several works have applied various machine learning (ML) techniques to solve MAPF, usually involving sophisticated architectures, reinforcement learning techniques, and set-ups, but none using large amounts of high-quality supervised data. Our initial objective in this work was to show how simple large scale imitation learning of high-quality heuristic search methods can lead to state-of-the-art ML MAPF performance. However, we find that, at least with our model architecture, simple large scale (700k examples with hundreds of agents per example) imitation learning does \textit{not} produce impressive results. Instead, we find that by using prior work that post-processes MAPF model predictions to resolve 1-step collisions (CS-PIBT), we can train a simple ML MAPF model in minutes that dramatically outperforms existing ML MAPF policies. This has serious implications for all future ML MAPF policies (with local communication) which currently struggle to scale. In particular, this finding implies that future learnt policies should (1) always use smart 1-step collision shields (e.g. CS-PIBT), (2) always include the collision shield with greedy actions as a baseline (e.g. PIBT) and (3) motivates future models to focus on longer horizon / more complex planning as 1-step collisions can be efficiently resolved.

翻译：多智能体路径规划（MAPF）旨在为共享工作空间中的一组智能体高效寻找无碰撞路径。MAPF领域长期以来主要关注高性能启发式搜索方法的开发。近期，若干研究尝试应用多种机器学习（ML）技术解决MAPF问题，通常涉及复杂架构、强化学习方法及实验设置，但均未利用大规模高质量监督数据。本研究最初旨在展示如何通过对高质量启发式搜索方法进行简单的大规模模仿学习，实现最先进的ML MAPF性能。然而我们发现，至少在现有模型架构下，简单的大规模模仿学习（使用70万个样本，每个样本包含数百个智能体）并未产生显著效果。相反，通过采用已有工作中对MAPF模型预测进行后处理以解决单步碰撞的方法（CS-PIBT），我们能够在数分钟内训练出一个简单的ML MAPF模型，其性能显著超越现有ML MAPF策略。这一发现对所有当前难以扩展的（具备局部通信能力的）未来ML MAPF策略具有重要启示：未来学习型策略应当（1）始终采用智能单步碰撞防护机制（如CS-PIBT），（2）始终将包含贪婪动作的碰撞防护机制作为基线方法（如PIBT），（3）激励未来模型聚焦于更长视野/更复杂规划，因为单步碰撞问题已能通过高效方式解决。