EMMA：面向自动驾驶的端到端多模态模型 (EMMA: End-to-End Multimodal Model for Autonomous Driving)

Jyh-Jing Hwang,Runsheng Xu,Hubert Lin,Wei-Chih Hung,Jingwei Ji,Kristy Choi,Di Huang,Tong He,Paul Covington,Benjamin Sapp,Yin Zhou,James Guo,Dragomir Anguelov,Mingxing Tan

from arxiv, Blog post: https://waymo.com/blog/2024/10/introducing-emma/

We introduce EMMA, an End-to-end Multimodal Model for Autonomous driving. Built on a multi-modal large language model foundation, EMMA directly maps raw camera sensor data into various driving-specific outputs, including planner trajectories, perception objects, and road graph elements. EMMA maximizes the utility of world knowledge from the pre-trained large language models, by representing all non-sensor inputs (e.g. navigation instructions and ego vehicle status) and outputs (e.g. trajectories and 3D locations) as natural language text. This approach allows EMMA to jointly process various driving tasks in a unified language space, and generate the outputs for each task using task-specific prompts. Empirically, we demonstrate EMMA's effectiveness by achieving state-of-the-art performance in motion planning on nuScenes as well as competitive results on the Waymo Open Motion Dataset (WOMD). EMMA also yields competitive results for camera-primary 3D object detection on the Waymo Open Dataset (WOD). We show that co-training EMMA with planner trajectories, object detection, and road graph tasks yields improvements across all three domains, highlighting EMMA's potential as a generalist model for autonomous driving applications. However, EMMA also exhibits certain limitations: it can process only a small amount of image frames, does not incorporate accurate 3D sensing modalities like LiDAR or radar and is computationally expensive. We hope that our results will inspire further research to mitigate these issues and to further evolve the state of the art in autonomous driving model architectures.

翻译：本文提出EMMA，一种面向自动驾驶的端到端多模态模型。该模型基于多模态大语言模型架构，能够直接将原始摄像头传感器数据映射为多种驾驶专用输出，包括规划轨迹、感知物体及道路图元素。EMMA通过将所有非传感器输入（如导航指令与自车状态）和输出（如轨迹与3D位置）表示为自然语言文本，最大限度地利用了预训练大语言模型中的世界知识。该方法使EMMA能够在统一语言空间中协同处理多种驾驶任务，并通过任务特定提示生成各任务输出。实验表明，EMMA在nuScenes数据集上的运动规划任务中取得了最先进的性能，在Waymo开放运动数据集（WOMD）上也获得了具有竞争力的结果。同时，EMMA在Waymo开放数据集（WOD）的摄像头主导3D物体检测任务中表现优异。我们证明，通过联合训练规划轨迹、物体检测和道路图任务，EMMA在所有三个领域均获得性能提升，彰显了其作为自动驾驶通用模型的潜力。然而，EMMA也存在一定局限性：仅能处理少量图像帧，未集成LiDAR或雷达等精确3D传感模态，且计算成本较高。我们希望本研究能启发后续工作克服这些限制，推动自动驾驶模型架构的进一步发展。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日