Multi-sensor modal fusion has demonstrated strong advantages in 3D object detection tasks. However, existing methods that fuse multi-modal features require transforming features into the bird's eye view space and may lose certain information on Z-axis, thus leading to inferior performance. To this end, we propose a novel end-to-end multi-modal fusion transformer-based framework, dubbed FusionFormer, that incorporates deformable attention and residual structures within the fusion encoding module. Specifically, by developing a uniform sampling strategy, our method can easily sample from 2D image and 3D voxel features spontaneously, thus exploiting flexible adaptability and avoiding explicit transformation to the bird's eye view space during the feature concatenation process. We further implement a residual structure in our feature encoder to ensure the model's robustness in case of missing an input modality. Through extensive experiments on a popular autonomous driving benchmark dataset, nuScenes, our method achieves state-of-the-art single model performance of 72.6% mAP and 75.1% NDS in the 3D object detection task without test time augmentation.
翻译:多传感器模态融合在3D目标检测任务中展现出显著优势。然而,现有融合多模态特征的方法需将特征转换至鸟瞰图空间,可能导致Z轴信息损失,从而影响性能。为此,我们提出一种新型端到端多模态融合Transformer框架——FusionFormer,其在融合编码模块中整合了可变形注意力与残差结构。具体而言,通过开发统一采样策略,该方法可自主从2D图像与3D体素特征中采样,从而利用灵活适应性规避特征拼接过程中的显式鸟瞰空间转换。进一步地,我们在特征编码器中引入残差结构,以确保模型在输入模态缺失时的鲁棒性。在主流自动驾驶基准数据集nuScenes上的大量实验表明,该方法在无测试时增强条件下,以72.6% mAP与75.1% NDS实现3D目标检测任务中单模型性能最优。