BMG-Q: Localized Bipartite Match Graph Attention Q-Learning for Ride-Pooling Order Dispatch

This paper introduces Localized Bipartite Match Graph Attention Q-Learning (BMG-Q), a novel Multi-Agent Reinforcement Learning (MARL) algorithm framework tailored for ride-pooling order dispatch. BMG-Q advances ride-pooling decision-making process with the localized bipartite match graph underlying the Markov Decision Process, enabling the development of novel Graph Attention Double Deep Q Network (GATDDQN) as the MARL backbone to capture the dynamic interactions among ride-pooling vehicles in fleet. Our approach enriches the state information for each agent with GATDDQN by leveraging a localized bipartite interdependence graph and enables a centralized global coordinator to optimize order matching and agent behavior using Integer Linear Programming (ILP). Enhanced by gradient clipping and localized graph sampling, our GATDDQN improves scalability and robustness. Furthermore, the inclusion of a posterior score function in the ILP captures the online exploration-exploitation trade-off and reduces the potential overestimation bias of agents, thereby elevating the quality of the derived solutions. Through extensive experiments and validation, BMG-Q has demonstrated superior performance in both training and operations for thousands of vehicle agents, outperforming benchmark reinforcement learning frameworks by around 10% in accumulative rewards and showing a significant reduction in overestimation bias by over 50%. Additionally, it maintains robustness amidst task variations and fleet size changes, establishing BMG-Q as an effective, scalable, and robust framework for advancing ride-pooling order dispatch operations.

翻译：本文提出了一种新颖的、专为拼车订单调度设计的局部二分匹配图注意力Q学习（BMG-Q）多智能体强化学习（MARL）算法框架。BMG-Q通过将局部二分匹配图作为马尔可夫决策过程的基础，推进了拼车决策过程，并由此开发了新颖的图注意力双深度Q网络（GATDDQN）作为MARL主干，以捕捉车队中拼车车辆间的动态交互。我们的方法利用局部二分相互依赖图，通过GATDDQN丰富了每个智能体的状态信息，并使一个集中式全局协调器能够使用整数线性规划（ILP）来优化订单匹配和智能体行为。通过梯度裁剪和局部图采样的增强，我们的GATDDQN提高了可扩展性和鲁棒性。此外，在ILP中加入后验评分函数捕捉了在线探索与利用的权衡，并减少了智能体潜在的高估偏差，从而提升了所得解决方案的质量。通过广泛的实验和验证，BMG-Q在数千个车辆智能体的训练和运营中均表现出优越的性能，其累积奖励比基准强化学习框架高出约10%，并且高估偏差显著降低了50%以上。此外，它在任务变化和车队规模变化中保持了鲁棒性，确立了BMG-Q作为一个有效、可扩展且鲁棒的框架，可用于推进拼车订单调度运营。