The very few studies that have attempted to formulate multi-agent reinforcement learning (RL) algorithms for adaptive traffic signal control have mainly used value-based RL methods although recent literature has shown that policy-based methods may perform better in partially observable environments. Additionally, because of the simplifying assumptions on signal timing made almost universally across previous studies, RL methods remain largely untested for real-world signal timing plans. This study formulates a multi-agent proximal policy optimization (MA-PPO) algorithm to implement adaptive and coordinated traffic control along an arterial corridor. The formulated MA-PPO has centralized critic architecture under the centralized training and decentralized execution framework. All agents are formulated to allow selection and implementation of up to eight signal phases as commonly implemented in the field controllers. The formulated algorithm is tested on a simulated real-world corridor with seven intersections, actual/complete traffic movements and signal phases, traffic volumes, and network geometry including intersection spacings. The performance of the formulated MA-PPO adaptive control algorithm is compared with the field implemented coordinated and actuated signal control (ASC) plans modeled using Vissim-MaxTime software in the loop simulation (SILs). The speed of convergence for each agent largely depended on the size of the action space which in turn depended on the number and sequence of signal phases. Compared with the currently implemented ASC signal timings, MA-PPO showed a travel time reduction of about 14% and 29%, respectively for the two through movements across the entire test corridor. Through volume sensitivity experiments, the formulated MA-PPO showed good stability, robustness and adaptability to changes in traffic demand.
翻译:尽管近期文献表明基于策略的方法在部分可观测环境中可能表现更优,但现有尝试将多智能体强化学习(RL)算法应用于自适应交通信号控制的极少研究主要采用基于价值的RL方法。此外,由于先前研究几乎普遍采用简化的信号配时假设,RL方法在真实世界信号配时方案中的有效性仍未得到充分验证。本研究构建了一种多智能体近端策略优化(MA-PPO)算法,用于实现主干道走廊的自适应协同交通控制。所构建的MA-PPO采用集中式训练与分散式执行框架下的集中评价器架构。所有智能体被设计为可选择并实施最多八种信号相位,这与现场控制器中的常见配置保持一致。该算法在一个模拟真实世界的七交叉口走廊上进行测试,测试环境包含完整真实的交通流向、信号相位、交通流量以及包含交叉口间距的网络几何结构。通过Vissim-MaxTime软件在环仿真(SILs)建模,将所构建的MA-PPO自适应控制算法性能与现场实施的协同感应式信号控制(ASC)方案进行对比。各智能体的收敛速度主要取决于动作空间规模,而该规模又受信号相位数量与序列的影响。与当前实施的ASC信号配时方案相比,MA-PPO在整个测试走廊的两个直行方向上分别实现了约14%和29%的行程时间缩减。通过流量敏感性实验验证,所构建的MA-PPO算法对交通需求变化表现出良好的稳定性、鲁棒性与适应性。