High-level autonomous driving requires motion planners capable of modeling multimodal future uncertainties while remaining robust in closed-loop interactions. Although diffusion-based planners are effective at modeling complex trajectory distributions, they often suffer from stochastic instabilities and the lack of corrective negative feedback when trained purely with imitation learning. To address these issues, we propose RAD-2, a unified generator-discriminator framework for closed-loop planning. Specifically, a diffusion-based generator is used to produce diverse trajectory candidates, while an RL-optimized discriminator reranks these candidates according to their long-term driving quality. This decoupled design avoids directly applying sparse scalar rewards to the full high-dimensional trajectory space, thereby improving optimization stability. To further enhance reinforcement learning, we introduce Temporally Consistent Group Relative Policy Optimization, which exploits temporal coherence to alleviate the credit assignment problem. In addition, we propose On-policy Generator Optimization, which converts closed-loop feedback into structured longitudinal optimization signals and progressively shifts the generator toward high-reward trajectory manifolds. To support efficient large-scale training, we introduce BEV-Warp, a high-throughput simulation environment that performs closed-loop evaluation directly in Bird's-Eye View feature space via spatial warping. RAD-2 reduces the collision rate by 56% compared with strong diffusion-based planners. Real-world deployment further demonstrates improved perceived safety and driving smoothness in complex urban traffic.
翻译:高级别自动驾驶需要运动规划器能够建模多模态未来不确定性,同时在闭环交互中保持鲁棒性。尽管基于扩散的规划器在建模复杂轨迹分布方面效果显著,但在纯模仿学习训练时,常因随机不稳定性及缺乏修正性负反馈而表现欠佳。为解决这些问题,我们提出RAD-2,一个面向闭环规划的统一生成器-判别器框架。具体而言,采用基于扩散的生成器生成多样化的轨迹候选,同时通过强化学习优化的判别器根据长期驾驶质量对这些候选进行重排序。这种解耦设计避免将稀疏标量奖励直接应用于整个高维轨迹空间,从而提升优化稳定性。为进一步强化学习效果,我们引入时序一致组相对策略优化(Temporally Consistent Group Relative Policy Optimization),利用时间连贯性缓解信用分配问题。此外,提出在线策略生成器优化(On-policy Generator Optimization),将闭环反馈转化为结构化的纵向优化信号,逐步将生成器推向高奖励轨迹流形。为支持高效大规模训练,我们设计BEV-Warp——一种高通量仿真环境,通过空间扭曲直接在鸟瞰图特征空间中进行闭环评估。相比强基线扩散型规划器,RAD-2将碰撞率降低56%。实际部署进一步证明了其在复杂城市交通中感知安全性与驾驶平顺性的提升。