Multi-sensor fusion is essential for accurate 3D object detection in self-driving systems. Camera and LiDAR are the most commonly used sensors, and usually, their fusion happens at the early or late stages of 3D detectors with the help of regions of interest (RoIs). On the other hand, fusion at the intermediate level is more adaptive because it does not need RoIs from modalities but is complex as the features of both modalities are presented from different points of view. In this paper, we propose a new intermediate-level multi-modal fusion (mmFUSION) approach to overcome these challenges. First, the mmFUSION uses separate encoders for each modality to compute features at a desired lower space volume. Second, these features are fused through cross-modality and multi-modality attention mechanisms proposed in mmFUSION. The mmFUSION framework preserves multi-modal information and learns to complement modalities' deficiencies through attention weights. The strong multi-modal features from the mmFUSION framework are fed to a simple 3D detection head for 3D predictions. We evaluate mmFUSION on the KITTI and NuScenes dataset where it performs better than available early, intermediate, late, and even two-stage based fusion schemes. The code with the mmdetection3D project plugin will be publicly available soon.
翻译:多传感器融合对自动驾驶系统中的精确三维物体检测至关重要。摄像头和激光雷达是最常用的传感器,其融合通常借助感兴趣区域(RoIs)在三维检测器的早期或后期阶段实现。然而,中间层融合更具自适应性——它无需依赖各模态的RoIs,但由于两种模态的特征从不同视角呈现,实现复杂度较高。本文提出一种新型中间层多模态融合方法(mmFUSION)以克服上述挑战。首先,mmFUSION为每个模态使用独立编码器,在所需的低维空间体积中计算特征。其次,通过mmFUSION提出的跨模态与多模态注意力机制融合这些特征。该方法框架可保留多模态信息,并基于注意力权重学习互补各模态的缺陷。由mmFUSION框架生成的强多模态特征被输入至简单三维检测头以完成三维预测。我们在KITTI和NuScenes数据集上评估mmFUSION,其在性能上优于现有的早期、中间、后期乃至两阶段融合方案。基于mmdetection3D项目插件的代码将很快公开。