Building a multi-modality multi-task neural network toward accurate and robust performance is a de-facto standard in perception task of autonomous driving. However, leveraging such data from multiple sensors to jointly optimize the prediction and planning tasks remains largely unexplored. In this paper, we present FusionAD, to the best of our knowledge, the first unified framework that fuse the information from two most critical sensors, camera and LiDAR, goes beyond perception task. Concretely, we first build a transformer based multi-modality fusion network to effectively produce fusion based features. In constrast to camera-based end-to-end method UniAD, we then establish a fusion aided modality-aware prediction and status-aware planning modules, dubbed FMSPnP that take advantages of multi-modality features. We conduct extensive experiments on commonly used benchmark nuScenes dataset, our FusionAD achieves state-of-the-art performance and surpassing baselines on average 15% on perception tasks like detection and tracking, 10% on occupancy prediction accuracy, reducing prediction error from 0.708 to 0.389 in ADE score and reduces the collision rate from 0.31% to only 0.12%.
翻译:构建一个面向精准与鲁棒性能的多模态多任务神经网络,已成为自动驾驶感知任务中的事实标准。然而,如何利用来自多个传感器的数据联合优化预测与规划任务,仍是一个尚未充分探索的领域。本文提出的FusionAD,据我们所知,是首个超越感知任务、融合两大关键传感器(摄像头与激光雷达)信息的统一框架。具体而言,我们首先构建了一个基于Transformer的多模态融合网络,以有效生成基于融合的特征。与基于摄像头的端到端方法UniAD不同,我们随后建立了一个融合辅助的模态感知预测与状态感知规划模块(简称FMSPnP),该模块充分利用多模态特征的优势。我们在常用基准数据集nuScenes上进行了广泛实验,我们的FusionAD在检测与跟踪等感知任务上平均超越基线15%,在占用预测准确率上提升10%,并将ADE评分中的预测误差从0.708降至0.389,同时将碰撞率从0.31%降至仅0.12%,取得了最先进的性能。