End-to-end autonomous driving policies based on Imitation Learning (IL) often struggle in closed-loop execution due to the misalignment between inadequate open-loop training objectives and real driving requirements. While Reinforcement Learning (RL) offers a solution by directly optimizing driving goals via reward signals, the rendering-based training environments introduce the rendering gap and are inefficient due to high computational costs. To overcome these challenges, we present a novel Pseudo-simulation-based RL method for closed-loop end-to-end autonomous driving, PerlAD. Based on offline datasets, PerlAD constructs a pseudo-simulation that operates in vector space, enabling efficient, rendering-free trial-and-error training. To bridge the gap between static datasets and dynamic closed-loop environments, PerlAD introduces a prediction world model that generates reactive agent trajectories conditioned on the ego vehicle's plan. Furthermore, to facilitate efficient planning, PerlAD utilizes a hierarchical decoupled planner that combines IL for lateral path generation and RL for longitudinal speed optimization. Comprehensive experimental results demonstrate that PerlAD achieves state-of-the-art performance on the Bench2Drive benchmark, surpassing the previous E2E RL method by 10.29% in Driving Score without requiring expensive online interactions. Additional evaluations on the DOS benchmark further confirm its reliability in handling safety-critical occlusion scenarios.
翻译:基于模仿学习(IL)的端到端自动驾驶策略在闭环执行中常因开环训练目标与实际驾驶需求不匹配而表现不佳。尽管强化学习(RL)能通过奖励信号直接优化驾驶目标,但基于渲染的训练环境存在渲染鸿沟问题,且因计算成本高昂而效率低下。为克服这些挑战,本文提出一种新颖的基于伪仿真的强化学习方法PerlAD,用于闭环端到端自动驾驶。PerlAD基于离线数据集构建在向量空间运行的伪仿真环境,实现高效、免渲染的试错训练。为弥合静态数据集与动态闭环环境之间的差距,PerlAD引入预测世界模型,该模型能根据本车规划生成反应式交通参与者轨迹。此外,为提升规划效率,PerlAD采用分层解耦规划器,结合IL进行横向路径生成与RL进行纵向速度优化。综合实验结果表明,PerlAD在Bench2Drive基准测试中达到最先进性能,驾驶评分较先前端到端RL方法提升10.29%,且无需昂贵的在线交互。在DOS基准上的补充评估进一步验证了其在处理安全关键遮挡场景中的可靠性。