End-to-end (E2E) autonomous driving models that take only camera images as input and directly predict a future trajectory are appealing for their computational efficiency and potential for improved generalization via unified optimization; however, persistent failure modes remain due to reliance on imitation learning (IL). While online reinforcement learning (RL) could mitigate IL-induced issues, the computational burden of neural rendering-based simulation and large E2E networks renders iterative reward and hyperparameter tuning costly. We introduce a camera-only E2E offline RL framework that performs no additional exploration and trains solely on a fixed simulator dataset. Offline RL offers strong data efficiency and rapid experimental iteration, yet is susceptible to instability from overestimation on out-of-distribution (OOD) actions. To address this, we construct pseudo ground-truth trajectories from expert driving logs and use them as a behavior regularization signal, suppressing imitation of unsafe or suboptimal behavior while stabilizing value learning. Training and closed-loop evaluation are conducted in a neural rendering environment learned from the public nuScenes dataset. Empirically, the proposed method achieves substantial improvements in collision rate and route completion compared with IL baselines. Our code is available at https://github.com/ToyotaInfoTech/PEBC.
翻译:仅以摄像头图像为输入、直接预测未来轨迹的端到端(E2E)自动驾驶模型,因其计算效率高且通过统一优化具备更强的泛化潜力而备受关注;然而,由于依赖模仿学习(IL),仍存在持续性故障模式。在线强化学习(RL)可缓解IL引发的问题,但基于神经渲染的仿真及大规模E2E网络带来的计算负担使得迭代奖励与超参数调优代价高昂。我们提出一种纯视觉端到端离线强化学习框架,该框架无需额外探索,仅基于固定仿真数据集进行训练。离线RL具有强大的数据效率与快速实验迭代能力,但易因对分布外(OOD)动作的过估计而产生不稳定。为此,我们从专家驾驶日志中构建伪真实轨迹,并将其作为行为正则化信号,抑制对不安全或次优行为的模仿,同时稳定值函数学习。训练与闭环评估均在基于公开nuScenes数据集学习的神经渲染环境中进行。实验表明,与IL基线相比,所提方法在碰撞率与路径完成度上均取得显著提升。我们的代码开源于https://github.com/ToyotaInfoTech/PEBC。