Autonomous Vehicles (AVs) have attracted significant attention in recent years and Reinforcement Learning (RL) has shown remarkable performance in improving the autonomy of vehicles. In that regard, the widely adopted Model-Free RL (MFRL) promises to solve decision-making tasks in connected AVs (CAVs), contingent on the readiness of a significant amount of data samples for training. Nevertheless, it might be infeasible in practice and possibly lead to learning instability. In contrast, Model-Based RL (MBRL) manifests itself in sample-efficient learning, but the asymptotic performance of MBRL might lag behind the state-of-the-art MFRL algorithms. Furthermore, most studies for CAVs are limited to the decision-making of a single AV only, thus underscoring the performance due to the absence of communications. In this study, we try to address the decision-making problem of multiple CAVs with limited communications and propose a decentralized Multi-Agent Probabilistic Ensembles with Trajectory Sampling algorithm MA-PETS. In particular, in order to better capture the uncertainty of the unknown environment, MA-PETS leverages Probabilistic Ensemble (PE) neural networks to learn from communicated samples among neighboring CAVs. Afterwards, MA-PETS capably develops Trajectory Sampling (TS)-based model-predictive control for decision-making. On this basis, we derive the multi-agent group regret bound affected by the number of agents within the communication range and mathematically validate that incorporating effective information exchange among agents into the multi-agent learning scheme contributes to reducing the group regret bound in the worst case. Finally, we empirically demonstrate the superiority of MA-PETS in terms of the sample efficiency comparable to MFBL.
翻译:自动驾驶车辆近年来受到广泛关注,强化学习在提升车辆自主性方面展现出卓越性能。在此背景下,广泛采用的无模型强化学习有望解决网联自动驾驶车辆的决策任务,但其实现依赖于大量训练数据样本的可用性。然而,这在实践中可能难以实现,并可能导致学习过程不稳定。相比之下,基于模型的强化学习表现出更高的样本效率,但其渐近性能可能落后于最先进的无模型强化学习算法。此外,现有关于网联自动驾驶车辆的研究大多局限于单一车辆的决策,由于缺乏通信机制而制约了系统性能。本研究致力于解决通信受限条件下的多网联自动驾驶车辆决策问题,提出了一种去中心化的多智能体概率集成与轨迹采样算法MA-PETS。该算法通过概率集成神经网络学习相邻网联自动驾驶车辆间传递的样本数据,以更好地捕捉未知环境的不确定性。在此基础上,MA-PETS构建了基于轨迹采样的模型预测控制决策框架。通过理论分析,我们推导了受通信范围内智能体数量影响的多智能体群体遗憾上界,并从数学上证明了在多智能体学习框架中引入有效信息交互机制有助于降低最坏情况下的群体遗憾上界。最后,实验结果表明MA-PETS在样本效率方面具有与无模型强化学习相当的优势。