Diffusion-based planners have emerged as a promising approach for human-like trajectory generation in autonomous driving. Recent works incorporate reinforcement fine-tuning to enhance the robustness of diffusion planners through reward-oriented optimization in a generation-evaluation loop. However, they struggle to generate multi-modal, scenario-adaptive trajectories, hindering the exploitation efficiency of informative rewards during fine-tuning. To resolve this, we propose PlannerRFT, a sample-efficient reinforcement fine-tuning framework for diffusion-based planners. PlannerRFT adopts a dual-branch optimization that simultaneously refines the trajectory distribution and adaptively guides the denoising process toward more promising exploration, without altering the original inference pipeline. To support parallel learning at scale, we develop nuMax, an optimized simulator that achieves 10 times faster rollout compared to native nuPlan. Extensive experiments shows that PlannerRFT yields state-of-the-art performance with distinct behaviors emerging during the learning process.
翻译:基于扩散的规划器已成为自动驾驶中生成类人轨迹的一种有前景的方法。近期研究通过融入强化微调,在生成-评估循环中进行面向奖励的优化,以增强扩散规划器的鲁棒性。然而,这些方法难以生成多模态、场景自适应的轨迹,阻碍了微调过程中信息丰富奖励的利用效率。为解决此问题,我们提出了PlannerRFT,一种面向扩散规划器的样本高效强化微调框架。PlannerRFT采用双分支优化策略,在不改变原始推理流程的前提下,同时优化轨迹分布并自适应地引导去噪过程朝向更具潜力的探索方向。为支持大规模并行学习,我们开发了nuMax,一个优化的模拟器,其仿真速度相比原生nuPlan提升了10倍。大量实验表明,PlannerRFT实现了最先进的性能,并在学习过程中涌现出独特的行为模式。