Traffic signal control aims to coordinate traffic signals across intersections to improve the traffic efficiency of a district or a city. Deep reinforcement learning (RL) has been applied to traffic signal control recently and demonstrated promising performance where each traffic signal is regarded as an agent. However, there are still several challenges that may limit its large-scale application in the real world. To make the policy learned from a training scenario generalizable to new unseen scenarios, a novel Meta Variationally Intrinsic Motivated (MetaVIM) RL method is proposed to learn the decentralized policy for each intersection that considers neighbor information in a latent way. Specifically, we formulate the policy learning as a meta-learning problem over a set of related tasks, where each task corresponds to traffic signal control at an intersection whose neighbors are regarded as the unobserved part of the state. Then, a learned latent variable is introduced to represent the task's specific information and is further brought into the policy for learning. In addition, to make the policy learning stable, a novel intrinsic reward is designed to encourage each agent's received rewards and observation transition to be predictable only conditioned on its own history. Extensive experiments conducted on CityFlow demonstrate that the proposed method substantially outperforms existing approaches and shows superior generalizability.
翻译:交通信号控制旨在协调路口间的交通信号,以提升区域或城市交通效率。近年来,深度强化学习被应用于交通信号控制领域,将每个交通信号视为智能体,并展现出显著性能。然而,其在大规模现实场景中的应用仍面临若干挑战。为使从训练场景中习得的策略能泛化至未见过的陌生场景,本文提出一种新型元变分内禀动机强化学习方法,以隐式方式考虑相邻路口信息,学习每个路口的分散式策略。具体而言,我们将策略学习形式化为一系列相关任务上的元学习问题,其中每个任务对应一个路口的交通信号控制,将其相邻路口视为状态的未观测部分。随后引入一个可学习的隐变量表征任务的特定信息,并将其融入策略学习过程。此外,为确保策略学习稳定性,我们设计了一种新型内禀奖励机制,仅基于智能体自身历史轨迹约束其接收的奖励与观测转移的可预测性。基于CityFlow的大量实验表明,所提方法显著优于现有方案,并展现出卓越的泛化能力。