Multi-sensor modal fusion has demonstrated strong advantages in 3D object detection tasks. However, existing methods that fuse multi-modal features require transforming features into the bird's eye view space and may lose certain information on Z-axis, thus leading to inferior performance. To this end, we propose a novel end-to-end multi-modal fusion transformer-based framework, dubbed FusionFormer, that incorporates deformable attention and residual structures within the fusion encoding module. Specifically, by developing a uniform sampling strategy, our method can easily sample from 2D image and 3D voxel features spontaneously, thus exploiting flexible adaptability and avoiding explicit transformation to the bird's eye view space during the feature concatenation process. We further implement a residual structure in our feature encoder to ensure the model's robustness in case of missing an input modality. Through extensive experiments on a popular autonomous driving benchmark dataset, nuScenes, our method achieves state-of-the-art single model performance of 72.6% mAP and 75.1% NDS in the 3D object detection task without test time augmentation.
翻译:摘要:多传感器模态融合在3D目标检测任务中展现出显著优势。然而,现有融合多模态特征的方法需将特征转换至鸟瞰空间(Bird's Eye View, BEV),这可能导致Z轴信息丢失,进而影响性能。为此,我们提出一种新颖的端到端多模态融合Transformer框架——FusionFormer,该框架在融合编码模块中集成了可变形注意力与残差结构。具体而言,通过开发统一采样策略,我们的方法可自动从2D图像和3D体素特征中进行采样,从而灵活适应融合过程,避免在特征拼接时显式转换至鸟瞰空间。我们进一步在特征编码器中引入残差结构,确保模型在输入模态缺失时的鲁棒性。在主流自动驾驶基准数据集nuScenes上的大量实验表明,本方法无需测试时数据增强即可在3D目标检测任务中实现72.6%的平均精度均值(mAP)和75.1%的nuScenes检测分数(NDS),达到单模型最先进水平。