Multi-sensor modal fusion has demonstrated strong advantages in 3D object detection tasks. However, existing methods that fuse multi-modal features through a simple channel concatenation require transformation features into bird's eye view space and may lose the information on Z-axis thus leads to inferior performance. To this end, we propose FusionFormer, an end-to-end multi-modal fusion framework that leverages transformers to fuse multi-modal features and obtain fused BEV features. And based on the flexible adaptability of FusionFormer to the input modality representation, we propose a depth prediction branch that can be added to the framework to improve detection performance in camera-based detection tasks. In addition, we propose a plug-and-play temporal fusion module based on transformers that can fuse historical frame BEV features for more stable and reliable detection results. We evaluate our method on the nuScenes dataset and achieve 72.6% mAP and 75.1% NDS for 3D object detection tasks, outperforming state-of-the-art methods.
翻译:多传感器模态融合在三维目标检测任务中展现出显著优势。然而,现有方法通过简单通道拼接融合多模态特征时,需将特征转换至鸟瞰图空间,可能导致Z轴信息丢失,从而影响检测性能。为此,我们提出FusionFormer——一种端到端多模态融合框架,利用Transformer融合多模态特征并获取融合后的BEV特征。基于FusionFormer对输入模态表示的灵活适应性,我们设计了一个可添加至框架中的深度预测分支,以提升基于摄像头的检测任务性能。此外,我们提出了一种基于Transformer的即插即用时间融合模块,该模块可融合历史帧BEV特征,实现更稳定可靠的检测结果。在nuScenes数据集上的评估显示,我们的方法在三维目标检测任务中达到72.6%的mAP和75.1%的NDS,性能超越当前最先进方法。