Vision-Language-Action (VLA) models for autonomous driving increasingly adopt generative planners trained with imitation learning followed by reinforcement learning. Diffusion-based planners suffer from modality alignment difficulties, low training efficiency, and limited generalization. Token-based planners are plagued by cumulative causal errors and irreversible decoding. In summary, the two dominant paradigms exhibit complementary strengths and weaknesses. In this paper, we propose DriveFine, a masked diffusion VLA model that combines flexible decoding with self-correction capabilities. In particular, we design a novel plug-and-play block-MoE, which seamlessly injects a refinement expert on top of the generation expert. By enabling explicit expert selection during inference and gradient blocking during training, the two experts are fully decoupled, preserving the foundational capabilities and generic patterns of the pretrained weights, which highlights the flexibility and extensibility of the block-MoE design. Furthermore, we design a hybrid reinforcement learning strategy that encourages effective exploration of refinement expert while maintaining training stability. Extensive experiments on NAVSIM v1, v2, and Navhard benchmarks demonstrate that DriveFine exhibits strong efficacy and robustness. The code will be released at https://github.com/MSunDYY/DriveFine.
翻译:自动驾驶中的视觉-语言-动作(VLA)模型日益采用生成式规划器,其训练通常先通过模仿学习,再结合强化学习进行。基于扩散的规划器存在模态对齐困难、训练效率低下以及泛化能力有限等问题。基于令牌的规划器则受累积因果误差和不可逆解码问题困扰。总体而言,这两种主流范式呈现出互补的优势与不足。本文提出DriveFine,一种结合灵活解码与自校正能力的掩码扩散VLA模型。具体而言,我们设计了一种新颖的即插即用模块化混合专家系统(block-MoE),可在生成专家之上无缝注入精细化专家。通过在推理阶段实现显式的专家选择,并在训练阶段进行梯度阻断,两个专家完全解耦,从而保留了预训练权重的基础能力与通用模式,这凸显了block-MoE设计的灵活性与可扩展性。此外,我们设计了一种混合强化学习策略,在保持训练稳定性的同时,促进对精细化专家的有效探索。在NAVSIM v1、v2及Navhard基准测试上的大量实验表明,DriveFine展现出卓越的效能与鲁棒性。代码将在https://github.com/MSunDYY/DriveFine发布。