Closed-loop cooperative driving requires planners that generate realistic multimodal multi-agent trajectories while improving safety and traffic efficiency. Existing diffusion planners can model multimodal behaviors from demonstrations, but they often exhibit weak scene consistency and remain poorly aligned with closed-loop objectives; meanwhile, stable online post-training in reactive multi-agent environments remains difficult. We present Multi-ORFT, which couples scene-conditioned diffusion pre-training with stable online reinforcement post-training. In pre-training, the planner uses inter-agent self-attention, cross-attention, and AdaLN-Zero-based scene conditioning to improve scene consistency and road adherence of joint trajectories. In post-training, we formulate a two-level MDP that exposes step-wise reverse-kernel likelihoods for online optimization, and combine dense trajectory-level rewards with variance-gated group-relative policy optimization (VG-GRPO) to stabilize training. On the WOMD closed-loop benchmark, Multi-ORFT reduces collision rate from 2.04% to 1.89% and off-road rate from 1.68% to 1.36%, while increasing average speed from 8.36 to 8.61 m/s relative to the pre-trained planner, and it outperforms strong open-source baselines including SMART-large, SMART-tiny-CLSFT, and VBD on the primary safety and efficiency metrics. These results show that coupling scene-consistent denoising with stable online diffusion-policy optimization improves the reliability of closed-loop cooperative driving.
翻译:闭环协作驾驶需要生成兼顾安全与交通效率、具有真实多模态特性的多智能体轨迹规划器。现有扩散规划器虽能从示范数据中建模多模态行为,但普遍存在场景一致性薄弱、与闭环目标对齐不足的问题;同时在反应式多智能体环境中进行稳定在线后训练仍具挑战性。我们提出多智能体ORFT方法,将场景条件化扩散预训练与稳定在线强化后训练相结合。预训练阶段,规划器通过智能体间自注意力、交叉注意力及基于AdaLN-Zero的场景条件化机制,提升联合轨迹的场景一致性与道路依从性。后训练阶段,我们构建了暴露逐步逆向核似然以实现在线优化的双层MDP,并融合密集轨迹级奖励与方差门控组相对策略优化(VG-GRPO)来稳定训练过程。在WOMD闭环基准测试中,与预训练规划器相比,多智能体ORFT将碰撞率从2.04%降至1.89%、偏离道路率从1.68%降至1.36%,平均速度从8.36m/s提升至8.61m/s,并在主要安全与效率指标上超越SMART-large、SMART-tiny-CLSFT、VBD等强开源基线。结果表明,将场景一致性去噪与稳定在线扩散策略优化相结合,可有效提升闭环协作驾驶的可靠性。