Semi-Markov Reinforcement Learning for City-Scale EV Ride-Hailing with Feasibility-Guaranteed Actions

We study city-scale control of electric-vehicle (EV) ride-hailing fleets where dispatch, repositioning, and charging decisions must respect charger and feeder limits under uncertain, spatially correlated demand and travel times. We formulate the problem as a hex-grid semi-Markov decision process (semi-MDP) with mixed actions -- discrete actions for serving, repositioning, and charging, together with continuous charging power -- and variable action durations. To guarantee physical feasibility during both training and deployment, the policy learns over high-level intentions produced by a masked, temperature-annealed actor. These intentions are projected at every decision step through a time-limited rolling mixed-integer linear program (MILP) that strictly enforces state-of-charge, port, and feeder constraints. To mitigate distributional shifts, we optimize a Soft Actor--Critic (SAC) agent against a Wasserstein-1 ambiguity set with a graph-aligned Mahalanobis ground metric that captures spatial correlations. The robust backup uses the Kantorovich--Rubinstein dual, a projected subgradient inner loop, and a primal--dual risk-budget update. Our architecture combines a two-layer Graph Convolutional Network (GCN) encoder, twin critics, and a value network that drives the adversary. Experiments on a large-scale EV fleet simulator built from NYC taxi data show that PD--RSAC achieves the highest net profit, reaching \$1.22M, compared with \$0.58M--\$0.70M for strong heuristic, single-agent RL, and multi-agent RL baselines, including Greedy, SAC, MAPPO, and MADDPG, while maintaining zero feeder-limit violations.

翻译：我们研究城市规模的电动网约车车队控制问题，其中调度、重定位和充电决策必须在不确定且空间相关的需求与行程时间条件下，同时尊重充电桩和馈线容量限制。我们将该问题建模为六边形网格半马尔可夫决策过程，其中混合动作包含离散动作（用于服务、重定位和充电）与连续充电功率，且动作持续时间可变。为保证训练与部署阶段的物理可行性，策略基于掩码温度退火演员网络产生的高层意图进行学习。这些意图在每个决策时刻通过一个限时滚动混合整数线性规划进行投影，以严格强制执行荷电状态、充电端口和馈线约束。为缓解分布偏移，我们使用基于Wasserstein-1模糊集与捕获空间相关性的图对齐马氏距离地标度量，对软演员-评论家智能体进行优化。鲁棒后备策略采用Kantorovich-Rubinstein对偶形式、投影次梯度内循环及原始-对偶风险预算更新机制。我们的架构包含两层图卷积网络编码器、双评论家网络及驱动对抗者的价值网络。基于纽约出租车数据构建的大规模电动汽车车队模拟器实验表明，PD-RSAC算法实现了最高净利润122万美元，远高于强启发式方法、单智能体强化学习和多智能体强化学习基线（包括Greedy、SAC、MAPPO和MADDPG）的58万至70万美元，同时保持零馈线容量违规。