RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework

High-level autonomous driving requires motion planners capable of modeling multimodal future uncertainties while remaining robust in closed-loop interactions. Although diffusion-based planners are effective at modeling complex trajectory distributions, they often suffer from stochastic instabilities and the lack of corrective negative feedback when trained purely with imitation learning. To address these issues, we propose RAD-2, a unified generator-discriminator framework for closed-loop planning. Specifically, a diffusion-based generator is used to produce diverse trajectory candidates, while an RL-optimized discriminator reranks these candidates according to their long-term driving quality. This decoupled design avoids directly applying sparse scalar rewards to the full high-dimensional trajectory space, thereby improving optimization stability. To further enhance reinforcement learning, we introduce Temporally Consistent Group Relative Policy Optimization, which exploits temporal coherence to alleviate the credit assignment problem. In addition, we propose On-policy Generator Optimization, which converts closed-loop feedback into structured longitudinal optimization signals and progressively shifts the generator toward high-reward trajectory manifolds. To support efficient large-scale training, we introduce BEV-Warp, a high-throughput simulation environment that performs closed-loop evaluation directly in Bird's-Eye View feature space via spatial warping. RAD-2 reduces the collision rate by 56% compared with strong diffusion-based planners. Real-world deployment further demonstrates improved perceived safety and driving smoothness in complex urban traffic.

翻译：高级别自动驾驶需要运动规划器能够建模多模态未来不确定性，同时在闭环交互中保持鲁棒性。尽管基于扩散的规划器在建模复杂轨迹分布方面效果显著，但在纯模仿学习训练时，常因随机不稳定性及缺乏修正性负反馈而表现欠佳。为解决这些问题，我们提出RAD-2，一个面向闭环规划的统一生成器-判别器框架。具体而言，采用基于扩散的生成器生成多样化的轨迹候选，同时通过强化学习优化的判别器根据长期驾驶质量对这些候选进行重排序。这种解耦设计避免将稀疏标量奖励直接应用于整个高维轨迹空间，从而提升优化稳定性。为进一步强化学习效果，我们引入时序一致组相对策略优化（Temporally Consistent Group Relative Policy Optimization），利用时间连贯性缓解信用分配问题。此外，提出在线策略生成器优化（On-policy Generator Optimization），将闭环反馈转化为结构化的纵向优化信号，逐步将生成器推向高奖励轨迹流形。为支持高效大规模训练，我们设计BEV-Warp——一种高通量仿真环境，通过空间扭曲直接在鸟瞰图特征空间中进行闭环评估。相比强基线扩散型规划器，RAD-2将碰撞率降低56%。实际部署进一步证明了其在复杂城市交通中感知安全性与驾驶平顺性的提升。

相关内容

生成器

关注 2

生成器是一次生成一个值的特殊类型函数。可以将其视为可恢复函数。调用该函数将返回一个可用于生成连续 x 值的生成【Generator】，简单的说就是在函数的执行过程中，yield语句会把你需要的值返回给调用生成器的地方，然后退出函数，下一次调用生成器函数的时候又从上次中断的地方开始执行，而生成器内的所有变量参数都会被保存下来供下一次使用。

【博士论文】面向可扩展且可信智能系统的强化学习

专知会员服务

12+阅读 · 5月13日

用于强化学习的扩散模型：基础、分类与发展

专知会员服务

23+阅读 · 2025年10月15日