Order dispatch is a critical task in ride-sharing systems with Autonomous Vehicles (AVs), directly influencing efficiency and profits. Recently, Multi-Agent Reinforcement Learning (MARL) has emerged as a promising solution to this problem by decomposing the large state and action spaces among individual agents, effectively addressing the Curse of Dimensionality (CoD) in transportation market, which is caused by the substantial number of vehicles, passengers, and orders. However, conventional MARL-based approaches heavily rely on accurate estimation of the value function, which becomes problematic in large-scale, highly uncertain environments. To address this issue, we propose two novel methods that bypass value function estimation, leveraging the homogeneous property of AV fleets. First, we draw an analogy between AV fleets and groups in Group Relative Policy Optimization (GRPO), adapting it to the order dispatch task. By replacing the Proximal Policy Optimization (PPO) baseline with the group average reward-to-go, GRPO eliminates critic estimation errors and reduces training bias. Inspired by this baseline replacement, we further propose One-Step Policy Optimization (OSPO), demonstrating that the optimal policy can be trained using only one-step group rewards under a homogeneous fleet. Experiments on a real-world ride-hailing dataset show that both GRPO and OSPO achieve promising performance across all scenarios, efficiently optimizing pickup times and the number of served orders using simple Multilayer Perceptron (MLP) networks. Furthermore, OSPO outperforms GRPO in all scenarios, attributed to its elimination of bias caused by the bounded time horizon of GRPO. Our code, trained models, and processed data are provided at https://github.com/RS2002/OSPO .
翻译:订单调度是自动驾驶车辆(AV)网约车系统中的关键任务,直接影响运营效率和利润。近年来,多智能体强化学习(MARL)通过将庞大的状态和动作空间分解到个体智能体,有效缓解了由大量车辆、乘客和订单导致的交通市场维度灾难问题,成为该领域极具前景的解决方案。然而,传统的基于MARL的方法严重依赖价值函数的精确估计,这在规模庞大、高度不确定的环境中会产生显著问题。为解决此问题,我们提出两种绕过价值函数估计的新方法,充分利用自动驾驶车队的同质特性。首先,我们将自动驾驶车队与群组相对策略优化(GRPO)中的群组进行类比,将其适配至订单调度任务。通过用群组平均累积奖励替代近端策略优化(PPO)基线,GRPO消除了评论家估计误差并降低了训练偏差。受此基线替换的启发,我们进一步提出单步策略优化(OSPO),证明在同质车队条件下仅需单步群组奖励即可训练最优策略。在真实网约车数据集上的实验表明,GRPO与OSPO在所有场景中均取得优异性能,仅使用简单的多层感知机(MLP)网络即可有效优化接载时间与完成订单数。此外,由于消除了GRPO有限时间视野导致的偏差,OSPO在所有场景中的表现均优于GRPO。我们的代码、训练模型及处理数据已发布于https://github.com/RS2002/OSPO。