Cooperative driving is a safety- and efficiency-critical task that requires the coordination of diverse, interaction-realistic multi-agent trajectories. Although existing diffusion-based methods can capture multimodal behaviors from demonstrations, they often exhibit weak scene consistency and poor alignment with closed-loop cooperative objectives. This makes post-training necessary for further improvement, yet achieving stable online post-training in reactive multi-agent environments remains challenging. In this paper, we propose SCORP, a scene-consistent multi-agent diffusion planner with stable online reinforcement learning (RL) post-training for cooperative driving. For pre-training, we develop a scene-conditioned multi-agent denoising architecture that couples inter-agent self-attention with a dual-path conditioning mechanism: cross-attention provides direct scene-information injection, while AdaLN-Zero enables additional flexible and stable conditional modulation, thereby improving the scene consistency and road adherence of joint trajectories. For post-training, we formulate a two-layer Markov decision process (MDP) that explicitly integrates the reverse denoising chain with policy-environment interaction. We further co-design dense, well-shaped planning rewards and variance-gated group-relative policy optimization (VG-GRPO) to mitigate advantage collapse and gradient instability during closed-loop training. Extensive experiments show that SCORP outperforms strong open-source baselines on WOMD, with 10.47%-28.26% and 1.70%-7.22% improvements in core safety and efficiency metrics, respectively. Moreover, compared with alternative post-training methods, SCORP delivers significant and consistent gains in both driving safety and traffic efficiency, highlighting stable and sustained advances in closed-loop cooperative driving.
翻译:摘要:协同驾驶是一项关乎安全与效率的关键任务,需要协调多样且交互真实的多智能体轨迹。现有基于扩散的方法虽能从示范数据中捕捉多模态行为,但往往场景一致性较弱,且与闭环协同目标的对齐性差。这使得后训练成为进一步提升的必要手段,然而在反应式多智能体环境中实现稳定的在线后训练仍具挑战。本文提出SCORP——一种面向协同驾驶的场景一致性多智能体扩散规划器,并配有稳定的在线强化学习后训练。在预训练阶段,我们设计了一种场景条件化的多智能体去噪架构,该架构将智能体间自注意力与双路径条件调节机制相结合:交叉注意力实现直接场景信息注入,而AdaLN-Zero则提供额外灵活且稳定的条件调制,从而提升联合轨迹的场景一致性与道路贴合度。在后训练阶段,我们构建了一个双层马尔可夫决策过程(MDP),将反向去噪链与策略-环境交互显式集成。进一步地,我们协同设计了密集、形态良好的规划奖励与方差门控群体相对策略优化(VG-GRPO),以缓解闭环训练中的优势塌缩与梯度不稳定问题。大量实验表明,SCORP在WOMD数据集上优于强开源基线,核心安全指标与效率指标分别提升10.47%-28.26%与1.70%-7.22%。此外,相较于其他后训练方法,SCORP在驾驶安全与交通效率方面均取得显著且一致的增益,展现了闭环协同驾驶的稳定持续进步。