D$^3$-MoE:Dual Disentangled Diffusion Mixture-of-Experts for Style-Controllable End-to-End Autonomous Driving

Traditional end-to-end autonomous driving frameworks frequently suffer from the "style-averaging" dilemma when trained on high-variance human demonstrations, yielding homogenized, style-uncontrollable, and even kinematically unsafe policies. To overcome this limitation, we present D$^3$-MoE (Dual Disentangled Diffusion Mixture-of-Experts), which disentangles trajectory modeling along two complementary axes. On the behavioral axis, generation is decoupled from selection: a style-conditioned diffusion process synthesizes multi-style candidate trajectories in parallel within a single scene, allowing a downstream module to select the optimal trajectory based on user preference or an evaluation score. On the physical axis, decoupled longitudinal and lateral routers activate their respective experts during inference time, trained without manual labels using self-supervised targets from orthogonal ground-truth kinematics. These activated experts, architected as Diffusion Transformers (DiT) and equipped with style-conditioned AdaLN and asymmetric lateral-fusion cross-attention, independently predict their corresponding physical state before being reassembled into a unified, kinematically coherent trajectory. Extensive evaluations on the challenging NAVSIM benchmark demonstrate that D$^3$-MoE achieves state-of-the-art planning performance, reaching 88.2 PDMS and 84.3 EPDMS by default. Moreover, our Best-of-Three ensemble strategy effectively broadens the multi-modal solution space, raising performance to 91.3 PDMS and 87.5 EPDMS. Both quantitative and qualitative analyses jointly confirm the framework's advantages in planning quality and style controllability.

翻译：传统端到端自动驾驶框架在训练高方差人类演示数据时，常陷入“风格平均化”困境，产生同质化、不可控风格甚至运动学不安全的策略。为解决此限制，我们提出D$^3$-MoE（双重解耦扩散混合专家模型），该方法沿两个互补维度对轨迹建模进行解耦。在行为维度上，生成与选择相分离：风格条件扩散过程在单场景内并行合成多风格候选轨迹，使下游模块可根据用户偏好或评估分数选择最优轨迹。在物理维度上，解耦的纵向与横向路由器在推理阶段激活各自专家，这些专家通过自监督目标（基于正交地面真值运动学）无需人工标注即可训练。这些被激活的专家采用扩散Transformer架构，并配备风格条件自适应层归一化与不对称横向融合交叉注意力机制，独立预测对应物理状态后重组为统一的运动学连贯轨迹。在具有挑战性的NAVSIM基准测试上的广泛评估表明，D$^3$-MoE默认达到88.2 PDMS和84.3 EPDMS的规划性能，实现最先进水平。此外，我们的最优三选一集成策略有效拓展了多模态解空间，将性能提升至91.3 PDMS和87.5 EPDMS。定量与定性分析共同验证了该框架在规划质量与风格可控性方面的优势。