Safety Guaranteed Robust Multi-Agent Reinforcement Learning with Hierarchical Control for Connected and Automated Vehicles

We address the problem of coordination and control of Connected and Automated Vehicles (CAVs) in the presence of imperfect observations in mixed traffic environment. A commonly used approach is learning-based decision-making, such as reinforcement learning (RL). However, most existing safe RL methods suffer from two limitations: (i) they assume accurate state information, and (ii) safety is generally defined over the expectation of the trajectories. It remains challenging to design optimal coordination between multi-agents while ensuring hard safety constraints under system state uncertainties (e.g., those that arise from noisy sensor measurements, communication, or state estimation methods) at every time step. We propose a safety guaranteed hierarchical coordination and control scheme called Safe-RMM to address the challenge. Specifically, the high-level coordination policy of CAVs in mixed traffic environment is trained by the Robust Multi-Agent Proximal Policy Optimization (RMAPPO) method. Though trained without uncertainty, our method leverages a worst-case Q network to ensure the model's robust performances when state uncertainties are present during testing. The low-level controller is implemented using model predictive control (MPC) with robust Control Barrier Functions (CBFs) to guarantee safety through their forward invariance property. We compare our method with baselines in different road networks in the CARLA simulator. Results show that our method provides best evaluated safety and efficiency in challenging mixed traffic environments with uncertainties.

翻译：本文研究在混合交通环境下存在不完美观测时，网联自动驾驶车辆（CAVs）的协调与控制问题。一种常用方法是基于学习的决策，例如强化学习（RL）。然而，现有的大多数安全RL方法存在两个局限性：（i）它们假设状态信息准确无误；（ii）安全性通常被定义为对轨迹期望的约束。在系统状态存在不确定性（例如，由噪声传感器测量、通信或状态估计方法引起）的情况下，如何在每个时间步设计多智能体之间的最优协调，同时确保严格的安全约束，仍然是一个挑战。我们提出了一种称为Safe-RMM的安全保障分层协调与控制方案来应对这一挑战。具体而言，混合交通环境中CAVs的高层协调策略通过鲁棒多智能体近端策略优化（RMAPPO）方法进行训练。尽管训练时未考虑不确定性，但我们的方法利用最坏情况Q网络来确保模型在测试期间存在状态不确定性时仍具有鲁棒性能。底层控制器采用模型预测控制（MPC）并结合鲁棒控制屏障函数（CBFs）来实现，通过其前向不变性来保证安全性。我们在CARLA模拟器中不同的道路网络上将我们的方法与基线方法进行了比较。结果表明，在具有不确定性的挑战性混合交通环境中，我们的方法提供了最佳的安全性和效率评估结果。