3D object detection is an important task that has been widely applied in autonomous driving. Recently, fusing multi-modal inputs, i.e., LiDAR and camera data, to perform this task has become a new trend. Existing methods, however, either ignore the sparsity of Lidar features or fail to preserve the original spatial structure of LiDAR and the semantic density of camera features simultaneously due to the modality gap. To address issues, this letter proposes a novel bidirectional complementary Lidar-camera fusion framework, called BiCo-Fusion that can achieve robust semantic- and spatial-aware 3D object detection. The key insight is to mutually fuse the multi-modal features to enhance the semantics of LiDAR features and the spatial awareness of the camera features and adaptatively select features from both modalities to build a unified 3D representation. Specifically, we introduce Pre-Fusion consisting of a Voxel Enhancement Module (VEM) to enhance the semantics of voxel features from 2D camera features and Image Enhancement Module (IEM) to enhance the spatial characteristics of camera features from 3D voxel features. Both VEM and IEM are bidirectionally updated to effectively reduce the modality gap. We then introduce Unified Fusion to adaptively weight to select features from the enchanted Lidar and camera features to build a unified 3D representation. Extensive experiments demonstrate the superiority of our BiCo-Fusion against the prior arts. Project page: https://t-ys.github.io/BiCo-Fusion/.
翻译:三维目标检测是自动驾驶领域广泛应用的重要任务。近年来,融合激光雷达与相机等多模态输入进行检测已成为新趋势。然而,现有方法或因模态差异,或忽视了激光雷达特征的稀疏性,或未能同时保持激光雷达的原始空间结构与相机特征的语义密度。为解决上述问题,本文提出一种新颖的双向互补激光雷达-相机融合框架——BiCo-Fusion,能够实现鲁棒的语义与空间感知三维目标检测。其核心思想是通过多模态特征的相互融合,增强激光雷达特征的语义信息与相机特征的空间感知能力,并自适应地选择双模态特征以构建统一的三维表征。具体而言,我们设计了预融合模块,包含体素增强模块(VEM)——利用二维相机特征增强体素特征的语义信息,以及图像增强模块(IEM)——利用三维体素特征增强相机特征的空间特性。VEM与IEM通过双向更新有效缩小模态差异。随后,我们引入统一融合模块,通过自适应加权从增强后的激光雷达与相机特征中选择信息,构建统一的三维表征。大量实验证明了BiCo-Fusion相较于现有方法的优越性。项目页面:https://t-ys.github.io/BiCo-Fusion/。