Previous work has shown that when multiple selfish Autonomous Vehicles (AVs) are introduced to future cities and start learning optimal routing strategies using Multi-Agent Reinforcement Learning (MARL), they may destabilize traffic systems, as they would require a significant amount of time to converge to the optimal solution, equivalent to years of real-world commuting. We demonstrate that moving beyond the selfish component in the reward significantly relieves this issue. If each AV, apart from minimizing its own travel time, aims to reduce its impact on the system, this will be beneficial not only for the system-wide performance but also for each individual player in this routing game. By introducing an intrinsic reward signal based on the marginal cost matrix, we significantly reduce training time and achieve convergence more reliably. Marginal cost quantifies the impact of each individual action (route-choice) on the system (total travel time). Including it as one of the components of the reward can reduce the degree of non-stationarity by aligning agents' objectives. Notably, the proposed counterfactual formulation preserves the system's equilibria and avoids oscillations. Our experiments show that training MARL algorithms with our novel reward formulation enables the agents to converge to the optimal solution, whereas the baseline algorithms fail to do so. We show these effects in both a toy network and the real-world network of Saint-Arnoult. Our results optimistically indicate that social awareness (i.e., including marginal costs in routing decisions) improves both the system-wide and individual performance of future urban systems with AVs.
翻译:先前的研究表明,当多辆自私的自动驾驶车辆(AVs)被引入未来城市并开始使用多智能体强化学习(MARL)学习最优路由策略时,它们可能会破坏交通系统的稳定性,因为它们需要极长的收敛时间才能达到最优解,这相当于现实世界中多年的通勤时间。我们证明,在奖励函数中超越自私成分能显著缓解这一问题。如果每辆自动驾驶车辆除了最小化自身行程时间外,还致力于减少其对系统的影响,这不仅有利于系统整体性能,对此路由博弈中的每个个体参与者同样有益。通过引入基于边际成本矩阵的内在奖励信号,我们显著减少了训练时间并实现了更可靠的收敛。边际成本量化了每个个体行为(路径选择)对系统(总行程时间)的影响。将其作为奖励函数的组成部分之一,可以通过对齐智能体的目标来降低非平稳性的程度。值得注意的是,所提出的反事实表述保留了系统的均衡并避免了振荡。我们的实验表明,采用我们新颖的奖励公式训练MARL算法能使智能体收敛至最优解,而基线算法则无法实现。我们在一个玩具网络和圣阿尔努的真实路网中均验证了这些效应。我们的结果乐观地表明,社会意识(即在路由决策中纳入边际成本)能够同时提升未来配备自动驾驶车辆的城市系统的整体性能与个体表现。