AutoMoT: A Unified Vision-Language-Action Model with Asynchronous Mixture-of-Transformers for End-to-End Autonomous Driving

Integrating vision-language models (VLMs) into end-to-end (E2E) autonomous driving (AD) systems has shown promise in improving scene understanding. However, existing integration strategies suffer from several limitations: they either struggle to resolve distribution misalignment between reasoning and action spaces, underexploit the general reasoning capabilities of pretrained VLMs, or incur substantial inference latency during action policy generation, which degrades driving performance. To address these challenges, we propose \OURS in this work, an end-to-end AD framework that unifies reasoning and action generation within a single vision-language-action (VLA) model. Our approach leverages a mixture-of-transformer (MoT) architecture with joint attention sharing, which preserves the general reasoning capabilities of pre-trained VLMs while enabling efficient fast-slow inference through asynchronous execution at different task frequencies. Extensive experiments on multiple benchmarks, under both open- and closed-loop settings, demonstrate that \OURS achieves competitive performance compared to state-of-the-art methods. We further investigate the functional boundary of pre-trained VLMs in AD, examining when AD-tailored fine-tuning is necessary. Our results show that pre-trained VLMs can achieve competitive multi-task scene understanding performance through semantic prompting alone, while fine-tuning remains essential for action-level tasks such as decision-making and trajectory planning. We refer to \href{https://automot-website.github.io/}{Project Page} for the demonstration videos and qualitative results.

翻译：将视觉语言模型（VLMs）集成到端到端（E2E）自动驾驶（AD）系统中已展现出提升场景理解能力的潜力。然而，现有集成策略存在若干局限：它们或难以解决推理空间与动作空间之间的分布失配问题，或未能充分利用预训练VLMs的通用推理能力，或在动作策略生成过程中产生显著推理延迟，从而降低驾驶性能。为应对这些挑战，本研究提出\OURS，这是一个将推理与动作生成统一于单一视觉-语言-动作（VLA）模型内的端到端AD框架。我们的方法采用具有联合注意力共享机制的混合Transformer（MoT）架构，该架构在保留预训练VLMs通用推理能力的同时，通过不同任务频率的异步执行实现高效的快慢推理。在开放与闭环设置下的多基准测试表明，\OURS相比最先进方法取得了具有竞争力的性能。我们进一步探究了预训练VLMs在AD中的功能边界，分析了何时需要进行AD定制化微调。实验结果表明，仅通过语义提示，预训练VLMs即可实现具有竞争力的多任务场景理解性能，而对于决策与轨迹规划等动作级任务，微调仍然至关重要。演示视频与定性结果请参见\href{https://automot-website.github.io/}{项目主页}。