Distributed Online Rollout for Multivehicle Routing in Unmapped Environments

In this work we consider a generalization of the well-known multivehicle routing problem: given a network, a set of agents occupying a subset of its nodes, and a set of tasks, we seek a minimum cost sequence of movements subject to the constraint that each task is visited by some agent at least once. The classical version of this problem assumes a central computational server that observes the entire state of the system perfectly and directs individual agents according to a centralized control scheme. In contrast, we assume that there is no centralized server and that each agent is an individual processor with no a priori knowledge of the underlying network (including task and agent locations). Moreover, our agents possess strictly local communication and sensing capabilities (restricted to a fixed radius around their respective locations), aligning more closely with several real-world multiagent applications. These restrictions introduce many challenges that are overcome through local information sharing and direct coordination between agents. We present a fully distributed, online, and scalable reinforcement learning algorithm for this problem whereby agents self-organize into local clusters and independently apply a multiagent rollout scheme locally to each cluster. We demonstrate empirically via extensive simulations that there exists a critical sensing radius beyond which the distributed rollout algorithm begins to improve over a greedy base policy. This critical sensing radius grows proportionally to the $\log^*$ function of the size of the network, and is, therefore, a small constant for any relevant network. Our decentralized reinforcement learning algorithm achieves approximately a factor of two cost improvement over the base policy for a range of radii bounded from below and above by two and three times the critical sensing radius, respectively.

翻译：本文考虑经典多车辆路径规划问题的一种推广：给定一个网络、占据网络中部分节点的智能体集合以及一组任务，需在满足每个任务至少被某一智能体访问一次的约束下，寻找最小代价的移动序列。该问题的经典版本假设存在一个中央计算服务器，能够完美观测整个系统状态，并根据集中式控制方案指挥各个智能体。与此相反，我们假设不存在中央服务器，每个智能体作为独立处理器运行，且对底层网络（包括任务和智能体位置）无先验知识。此外，智能体仅具备严格局部的通信与感知能力（限制在其各自位置周围固定半径范围内），这更贴近现实中的多智能体应用场景。这些限制带来了诸多挑战，但可通过智能体间的局部信息共享与直接协调加以克服。我们针对该问题提出了一种完全分布式、在线且可扩展的强化学习算法：智能体自主组织为局部集群，并独立在每个集群上应用多智能体回滚方案。通过大量仿真实验，我们证明存在一个临界感知半径，当超过该半径时，分布式回滚算法开始优于贪婪基策略。该临界感知半径与网络规模的$\log^*$函数成比例增长，因此对于任何实际网络而言均为一个较小的常数。在临界感知半径下界与上界分别对应其2倍与3倍范围内，我们所提出的去中心化强化学习算法较基策略实现了约两倍的代价改善。