End-to-end autonomous driving (E2E-AD) demands effective processing of multi-view sensory data and robust handling of diverse and complex driving scenarios, particularly rare maneuvers such as aggressive turns. Recent success of Mixture-of-Experts (MoE) architecture in Large Language Models (LLMs) demonstrates that specialization of parameters enables strong scalability. In this work, we propose DriveMoE, a novel MoE-based E2E-AD framework, with a Scene-Specialized Vision MoE and a Skill-Specialized Action MoE. DriveMoE is built upon our $π_0$ Vision-Language-Action (VLA) baseline (originally from the embodied AI field), called Drive-$π_0$. Specifically, we add Vision MoE to Drive-$π_0$ by training a router to select relevant cameras according to the driving context dynamically. This design mirrors human driving cognition, where drivers selectively attend to crucial visual cues rather than exhaustively processing all visual information. In addition, we add Action MoE by training another router to activate specialized expert modules for different driving behaviors. Through explicit behavioral specialization, DriveMoE is able to handle diverse scenarios without suffering from modes averaging like existing models. In Bench2Drive closed-loop evaluation experiments, DriveMoE achieves state-of-the-art (SOTA) performance, demonstrating the effectiveness of combining vision and action MoE in autonomous driving tasks. We will release our code and models of DriveMoE and Drive-$π_0$.
翻译:端到端自动驾驶(E2E-AD)要求高效处理多视角传感数据,并稳健应对多样复杂的驾驶场景,尤其是激进转弯等罕见操作。混合专家(MoE)架构在大型语言模型(LLMs)中的最新成功表明,参数专业化能够实现强大的可扩展性。本文提出DriveMoE,一种新颖的基于MoE的E2E-AD框架,包含场景专用的视觉MoE和技能专用的动作MoE。DriveMoE基于我们源自具身智能领域的$\pi_0$视觉-语言-动作(VLA)基线模型(称为Drive-$\pi_0$)构建。具体而言,我们通过训练一个路由器根据驾驶上下文动态选择相关摄像头,将视觉MoE集成到Drive-$\pi_0$中。该设计模拟了人类驾驶认知过程——驾驶员会选择性关注关键视觉线索,而非穷尽处理所有视觉信息。此外,我们通过训练另一个路由器激活专门处理不同驾驶行为的专家模块,引入动作MoE。借助显式的行为专业化,DriveMoE能够处理多样化场景,而不会像现有模型那样遭受模式平均问题。在Bench2Drive闭环评估实验中,DriveMoE实现了最先进的性能,证明了将视觉与动作MoE相结合在自动驾驶任务中的有效性。我们将公开发布DriveMoE与Drive-$\pi_0$的代码和模型。