Vision-Language-Action (VLA) models have recently emerged in autonomous driving, with the promise of leveraging rich world knowledge to improve the cognitive capabilities of driving systems. However, adapting such models for driving tasks currently faces a critical dilemma between spatial perception and semantic reasoning. Consequently, existing VLA systems are forced into suboptimal compromises: directly adopting 2D Vision-Language Models yields limited spatial perception, whereas enhancing them with 3D spatial representations often impairs the native reasoning capacity of VLMs. We argue that this dilemma largely stems from the coupled optimization of spatial perception and semantic reasoning within shared model parameters. To overcome this, we propose UniDriveVLA, a Unified Driving Vision-Language-Action model based on Mixture-of-Transformers that addresses the perception-reasoning conflict via expert decoupling. Specifically, it comprises three experts for driving understanding, scene perception, and action planning, which are coordinated through masked joint attention. In addition, we combine a sparse perception paradigm with a three-stage progressive training strategy to improve spatial perception while maintaining semantic reasoning capability. Extensive experiments show that UniDriveVLA achieves state-of-the-art performance in open-loop evaluation on nuScenes and closed-loop evaluation on Bench2Drive. Moreover, it demonstrates strong performance across a broad range of perception, prediction, and understanding tasks, including 3D detection, online mapping, motion forecasting, and driving-oriented VQA, highlighting its broad applicability as a unified model for autonomous driving. Code and model have been released at https://github.com/xiaomi-research/unidrivevla
翻译:视觉-语言-动作(VLA)模型近期在自动驾驶领域崭露头角,其通过利用丰富的世界知识提升驾驶系统的认知能力。然而,将此类模型适配至驾驶任务目前面临空间感知与语义推理之间的关键困境。因此,现有VLA系统被迫采取次优折中方案:直接采用二维视觉语言模型会导致空间感知能力受限,而通过三维空间表示增强其能力又常会削弱VLM的原生推理能力。我们认为,这一困境主要源于空间感知与语义推理在共享模型参数中的耦合优化。为突破此局限,我们提出UniDriveVLA——一种基于Transformer混合架构的统一驾驶视觉-语言-动作模型,通过专家解耦机制解决感知-推理冲突。具体而言,该模型包含驾驶理解、场景感知与动作规划三个专家模块,通过掩码联合注意力机制进行协调。此外,我们结合稀疏感知范式与三阶段渐进式训练策略,在保持语义推理能力的同时提升空间感知性能。大量实验表明,UniDriveVLA在nuScenes数据集上的开环评估与Bench2Drive数据集上的闭环评估中均取得最优性能。同时,该模型在三维检测、在线地图构建、运动预测及面向驾驶的VQA等多种感知、预测与理解任务中表现优异,凸显其作为自动驾驶统一模型的广泛适用性。代码与模型已发布至https://github.com/xiaomi-research/unidrivevla。