Multiagent Reinforcement Learning for Autonomous Routing and Pickup Problem with Adaptation to Variable Demand

We derive a learning framework to generate routing/pickup policies for a fleet of autonomous vehicles tasked with servicing stochastically appearing requests on a city map. We focus on policies that 1) give rise to coordination amongst the vehicles, thereby reducing wait times for servicing requests, 2) are non-myopic, and consider a-priori potential future requests, 3) can adapt to changes in the underlying demand distribution. Specifically, we are interested in policies that are adaptive to fluctuations of actual demand conditions in urban environments, such as on-peak vs. off-peak hours. We achieve this through a combination of (i) an online play algorithm that improves the performance of an offline-trained policy, and (ii) an offline approximation scheme that allows for adapting to changes in the underlying demand model. In particular, we achieve adaptivity of our learned policy to different demand distributions by quantifying a region of validity using the q-valid radius of a Wasserstein Ambiguity Set. We propose a mechanism for switching the originally trained offline approximation when the current demand is outside the original validity region. In this case, we propose to use an offline architecture, trained on a historical demand model that is closer to the current demand in terms of Wasserstein distance. We learn routing and pickup policies over real taxicab requests in San Francisco with high variability between on-peak and off-peak hours, demonstrating the ability of our method to adapt to real fluctuation in demand distributions. Our numerical results demonstrate that our method outperforms alternative rollout-based reinforcement learning schemes, as well as other classical methods from operations research.

翻译：我们提出一种学习框架，用于为城市地图上服务随机出现请求的车队生成路由/接驳策略。我们重点关注以下策略：1）促进车辆间的协调，从而减少服务请求的等待时间；2）具有非短视性，并预先考虑未来可能出现的请求；3）能够适应底层需求分布的变化。具体而言，我们关注能够适应城市环境中实际需求条件波动的策略，例如高峰时段与非高峰时段的差异。我们通过以下方法的结合实现这一目标：（i）一种在线执行算法，用于提升离线训练策略的性能；（ii）一种离线近似方案，允许适应底层需求模型的变化。特别地，我们通过利用Wasserstein模糊集定义的q-有效半径量化策略的有效区域，使所学策略能够自适应不同需求分布。针对当前需求超出原始有效区域的情况，我们提出一种机制来切换原有训练的离线近似方案：此时采用基于历史需求模型训练的离线架构，该模型与当前需求的Wasserstein距离更短。我们基于旧金山出租车真实请求数据（高峰与非高峰时段需求差异显著）学习路由与接驳策略，证明了方法对实际需求分布波动的自适应能力。数值结果表明，我们的方法优于基于rollout的替代强化学习方案及其他运筹学经典方法。