Building a multi-modality multi-task neural network toward accurate and robust performance is a de-facto standard in perception task of autonomous driving. However, leveraging such data from multiple sensors to jointly optimize the prediction and planning tasks remains largely unexplored. In this paper, we present FusionAD, to the best of our knowledge, the first unified framework that fuse the information from two most critical sensors, camera and LiDAR, goes beyond perception task. Concretely, we first build a transformer based multi-modality fusion network to effectively produce fusion based features. In constrast to camera-based end-to-end method UniAD, we then establish a fusion aided modality-aware prediction and status-aware planning modules, dubbed FMSPnP that take advantages of multi-modality features. We conduct extensive experiments on commonly used benchmark nuScenes dataset, our FusionAD achieves state-of-the-art performance and surpassing baselines on average 15% on perception tasks like detection and tracking, 10% on occupancy prediction accuracy, reducing prediction error from 0.708 to 0.389 in ADE score and reduces the collision rate from 0.31% to only 0.12%.
翻译:构建一个多模态多任务神经网络以实现准确且鲁棒的性能,是自动驾驶感知任务中的事实标准。然而,利用来自多个传感器的此类数据共同优化预测与规划任务仍少有探索。在本文中,我们提出FusionAD——据我们所知,首个超越感知任务、融合两个最关键传感器(摄像头与激光雷达)信息的统一框架。具体而言,我们首先构建基于Transformer的多模态融合网络,以高效生成融合特征。与基于摄像头的端到端方法UniAD不同,我们随后建立融合辅助的模态感知预测模块与状态感知规划模块(简称FMSPnP),以充分利用多模态特征。我们在常用基准数据集nuScenes上进行了大量实验,我们的FusionAD在感知任务(如检测与跟踪)上平均超越基线15%,在占用预测精度上超越10%,并将平均位移误差(ADE)得分从0.708降至0.389,碰撞率从0.31%降至仅0.12%,实现了最先进的性能。