Distributed Online Rollout for Multivehicle Routing in Unmapped Environments

In this work we consider a generalization of the well-known multivehicle routing problem: given a network, a set of agents occupying a subset of its nodes, and a set of tasks, we seek a minimum cost sequence of movements subject to the constraint that each task is visited by some agent at least once. The classical version of this problem assumes a central computational server that observes the entire state of the system perfectly and directs individual agents according to a centralized control scheme. In contrast, we assume that there is no centralized server and that each agent is an individual processor with no a priori knowledge of the underlying network (including task and agent locations). Moreover, our agents possess strictly local communication and sensing capabilities (restricted to a fixed radius around their respective locations), aligning more closely with several real-world multiagent applications. These restrictions introduce many challenges that are overcome through local information sharing and direct coordination between agents. We present a fully distributed, online, and scalable reinforcement learning algorithm for this problem whereby agents self-organize into local clusters and independently apply a multiagent rollout scheme locally to each cluster. We demonstrate empirically via extensive simulations that there exists a critical sensing radius beyond which the distributed rollout algorithm begins to improve over a greedy base policy. This critical sensing radius grows proportionally to the $\log^*$ function of the size of the network, and is, therefore, a small constant for any relevant network. Our decentralized reinforcement learning algorithm achieves approximately a factor of two cost improvement over the base policy for a range of radii bounded from below and above by two and three times the critical sensing radius, respectively.

翻译：本文研究了经典多车辆路由问题的一个推广形式：给定一个网络、一组占据其部分节点的智能体以及一组任务，需在约束每个任务至少被某个智能体访问一次的条件下，寻求最小代价的运动序列。该问题的经典版本假设存在一个中央计算服务器，能完美观测系统整体状态并根据集中式控制方案指挥各个智能体。与此相反，本文假设不存在集中式服务器，每个智能体作为独立处理器，对底层网络（包括任务与智能体位置）无先验知识。此外，我们的智能体仅具备严格局部的通信与感知能力（限制在以各自位置为中心的固定半径范围内），这与多个现实世界多智能体应用更为契合。这些限制通过智能体间的局部信息共享与直接协调得以克服。本文针对该问题提出了一种完全分布式、在线且可扩展的强化学习算法：智能体自组织形成局部集群，并在每个集群内独立应用多智能体滚动方案。通过大量仿真实验，我们实证发现存在一个临界感知半径，超过该半径后分布式滚动算法开始优于贪婪基策略。该临界感知半径随网络规模以$\log^*$函数比例增长，因此对任何实际网络而言均为小常数。对于上下界分别为临界感知半径两倍与三倍的一系列半径取值，我们的去中心化强化学习算法相较于基策略实现了约两倍的代价改进。