LiDAR-camera fusion can enhance the performance of 3D object detection by utilizing complementary information between depth-aware LiDAR points and semantically rich images. Existing voxel-based methods face significant challenges when fusing sparse voxel features with dense image features in a one-to-one manner, resulting in the loss of the advantages of images, including semantic and continuity information, leading to sub-optimal detection performance, especially at long distances. In this paper, we present VoxelNextFusion, a multi-modal 3D object detection framework specifically designed for voxel-based methods, which effectively bridges the gap between sparse point clouds and dense images. In particular, we propose a voxel-based image pipeline that involves projecting point clouds onto images to obtain both pixel- and patch-level features. These features are then fused using a self-attention to obtain a combined representation. Moreover, to address the issue of background features present in patches, we propose a feature importance module that effectively distinguishes between foreground and background features, thus minimizing the impact of the background features. Extensive experiments were conducted on the widely used KITTI and nuScenes 3D object detection benchmarks. Notably, our VoxelNextFusion achieved around +3.20% in [email protected] improvement for car detection in hard level compared to the Voxel R-CNN baseline on the KITTI test dataset
翻译:激光雷达-相机融合通过利用深度感知点云与语义丰富图像的互补信息,可提升3D目标检测性能。现有基于体素的方法在将稀疏体素特征与稠密图像特征进行一对一融合时面临显著挑战,导致图像语义与连续性信息的优势丧失,进而使检测性能欠佳(尤其在远距离场景)。本文提出VoxelNextFusion——一种专为基于体素的方法设计的多模态3D目标检测框架,有效弥合了稀疏点云与稠密图像之间的鸿沟。具体而言,我们提出基于体素的图像处理流程:将点云投影至图像以获取像素级和块级特征,随后通过自注意力机制融合这些特征获得联合表征。为解决块特征中背景干扰问题,我们引入特征重要性模块,可有效区分前景与背景特征,从而削弱背景特征的影响。在广泛使用的KITTI与nuScenes 3D目标检测基准上的大量实验表明,在KITTI测试集上,相较于Voxel R-CNN基线,本方法在hard级别车辆检测中实现约+3.20%的[email protected]提升。