Autonomous Intersection Management (AIM) provides a signal-free intersection scheduling paradigm for Connected Autonomous Vehicles (CAVs). Distributed learning method has emerged as an attractive branch of AIM research. Compared with centralized AIM, distributed AIM can be deployed to CAVs at a lower cost, and compared with rule-based and optimization-based method, learning-based method can treat various complicated real-time intersection scenarios more flexibly. Deep reinforcement learning (DRL) is the mainstream approach in distributed learning to address AIM problems. However, the large-scale simultaneous interactive decision of multiple agents and the rapid changes of environment caused by interactions pose challenges for DRL, making its reward curve oscillating and hard to converge, and ultimately leading to a compromise in safety and computing efficiency. For this, we propose a non-RL learning framework, called Distributed Hierarchical Adversarial Learning (D-HAL). The framework includes an actor network that generates the actions of each CAV at each step. The immediate discriminator evaluates the interaction performance of the actor network at the current step, while the final discriminator makes the final evaluation of the overall trajectory from a series of interactions. In this framework, the long-term outcome of the behavior no longer motivates the actor network in terms of discounted rewards, but rather through a designed adversarial loss function with discriminative labels. The proposed model is evaluated at a four-way-six-lane intersection, and outperforms several state-of-the-art methods on ensuring safety and reducing travel time.
翻译:自主交叉口管理(AIM)为网联自动驾驶车辆(CAVs)提供了无信号灯交叉口的调度范式。分布式学习方法已成为AIM研究的重要分支。相较于集中式AIM,分布式AIM能以更低成本部署至CAVs;而与基于规则和优化的方法相比,基于学习的方法能更灵活地处理各类复杂的实时交叉口场景。深度强化学习(DRL)是解决AIM问题的分布式学习主流方法。然而,多智能体的大规模同步交互决策以及交互引发的环境快速变化给DRL带来挑战,导致其奖励曲线震荡且难以收敛,最终在安全性和计算效率上做出妥协。为此,我们提出一种非强化学习框架——分布式层级对抗学习(D-HAL)。该框架包含一个生成各CAV每步动作的演员网络,即时判别器评估当前步演员网络的交互表现,而最终判别器则从一系列交互中对整体轨迹做出最终评价。在该框架中,行为的长期结果不再以折扣奖励形式激励演员网络,而是通过带有判别标签的对抗损失函数实现。该模型在一个四向六车道交叉口场景下进行评估,在保障安全性和减少行程时间方面均优于多种现有最优方法。