A safe and efficient decision-making system is crucial for autonomous vehicles. However, the complexity of driving environments limits the effectiveness of many rule-based and machine learning approaches. Reinforcement Learning (RL), with its robust self-learning capabilities and environmental adaptability, offers a promising solution to these challenges. Nevertheless, safety and efficiency concerns during training hinder its widespread application. To address these concerns, we propose a novel RL framework, Simple to Complex Collaborative Decision (S2CD). First, we rapidly train the teacher model in a lightweight simulation environment. In the more complex and realistic environment, teacher intervenes when the student agent exhibits suboptimal behavior by assessing actions' value to avert dangers. We also introduce an RL algorithm called Adaptive Clipping Proximal Policy Optimization Plus, which combines samples from both teacher and student policies and employs dynamic clipping strategies based on sample importance. This approach improves sample efficiency while effectively alleviating data imbalance. Additionally, we employ the Kullback-Leibler divergence as a policy constraint, transforming it into an unconstrained problem with the Lagrangian method to accelerate the student's learning. Finally, a gradual weaning strategy ensures that the student learns to explore independently over time, overcoming the teacher's limitations and maximizing performance. Simulation experiments in highway lane-change scenarios show that the S2CD framework enhances learning efficiency, reduces training costs, and significantly improves safety compared to state-of-the-art algorithms. This framework also ensures effective knowledge transfer between teacher and student models, even with suboptimal teachers, the student achieves superior performance, demonstrating the robustness and effectiveness of S2CD.
翻译:安全高效的决策系统对自动驾驶车辆至关重要。然而,驾驶环境的复杂性限制了许多基于规则和机器学习方法的有效性。强化学习凭借其强大的自学习能力和环境适应性,为解决这些挑战提供了有前景的方案。尽管如此,训练过程中的安全性和效率问题阻碍了其广泛应用。为解决这些问题,我们提出了一种新颖的强化学习框架——从简到繁协同决策。首先,我们在轻量级仿真环境中快速训练教师模型。在更复杂、更真实的环境中,教师通过评估行动的价值来避免危险,当学生智能体表现出次优行为时进行干预。我们还引入了一种名为自适应裁剪近端策略优化+的强化学习算法,该算法结合了教师策略和学生策略的样本,并基于样本重要性采用动态裁剪策略。这种方法提高了样本效率,同时有效缓解了数据不平衡问题。此外,我们采用Kullback-Leibler散度作为策略约束,通过拉格朗日方法将其转化为无约束问题,以加速学生的学习。最后,逐步脱离策略确保学生随着时间推移学会独立探索,克服教师的局限性并最大化性能。在高速公路换道场景中的仿真实验表明,与最先进的算法相比,S2CD框架提高了学习效率,降低了训练成本,并显著提升了安全性。该框架还确保了教师模型与学生模型之间的有效知识迁移,即使在教师次优的情况下,学生也能实现卓越性能,这证明了S2CD的鲁棒性和有效性。