By identifying four important components of existing LiDAR-camera 3D object detection methods (LiDAR and camera candidates, transformation, and fusion outputs), we observe that all existing methods either find dense candidates or yield dense representations of scenes. However, given that objects occupy only a small part of a scene, finding dense candidates and generating dense representations is noisy and inefficient. We propose SparseFusion, a novel multi-sensor 3D detection method that exclusively uses sparse candidates and sparse representations. Specifically, SparseFusion utilizes the outputs of parallel detectors in the LiDAR and camera modalities as sparse candidates for fusion. We transform the camera candidates into the LiDAR coordinate space by disentangling the object representations. Then, we can fuse the multi-modality candidates in a unified 3D space by a lightweight self-attention module. To mitigate negative transfer between modalities, we propose novel semantic and geometric cross-modality transfer modules that are applied prior to the modality-specific detectors. SparseFusion achieves state-of-the-art performance on the nuScenes benchmark while also running at the fastest speed, even outperforming methods with stronger backbones. We perform extensive experiments to demonstrate the effectiveness and efficiency of our modules and overall method pipeline. Our code will be made publicly available at https://github.com/yichen928/SparseFusion.
翻译:通过识别现有激光雷达-相机3D目标检测方法的四个重要组成部分(激光雷达与相机候选框、变换及融合输出),我们观察到现有方法要么生成密集候选框,要么产生场景的密集表示。然而,由于目标仅占据场景的极小部分,生成密集候选框和密集表示存在噪声且效率低下。为此,我们提出SparseFusion——一种仅使用稀疏候选框和稀疏表示的新型多传感器3D检测方法。具体而言,SparseFusion利用激光雷达和相机模态中并行检测器的输出作为融合的稀疏候选框。通过解耦目标表示,我们将相机候选框变换至激光雷达坐标系空间。随后,通过轻量级自注意力模块,我们可在统一的三维空间中融合多模态候选框。为缓解模态间的负迁移,我们提出新颖的语义与几何跨模态迁移模块,并将其应用于模态专用检测器之前。SparseFusion在nuScenes基准测试中达到最先进性能,同时保持最快运行速度,甚至超越了使用更强骨干网络的方法。我们通过大量实验验证了各模块及整体方法管线的有效性与高效性。代码将开源在https://github.com/yichen928/SparseFusion。