Leveraging multi-modal fusion, especially between camera and LiDAR, has become essential for building accurate and robust 3D object detection systems for autonomous vehicles. Until recently, point decorating approaches, in which point clouds are augmented with camera features, have been the dominant approach in the field. However, these approaches fail to utilize the higher resolution images from cameras. Recent works projecting camera features to the bird's-eye-view (BEV) space for fusion have also been proposed, however they require projecting millions of pixels, most of which only contain background information. In this work, we propose a novel approach Center Feature Fusion (CFF), in which we leverage center-based detection networks in both the camera and LiDAR streams to identify relevant object locations. We then use the center-based detection to identify the locations of pixel features relevant to object locations, a small fraction of the total number in the image. These are then projected and fused in the BEV frame. On the nuScenes dataset, we outperform the LiDAR-only baseline by 4.9% mAP while fusing up to 100x fewer features than other fusion methods.
翻译:利用多模态融合(尤其是摄像头与激光雷达的融合)已成为构建高精度、高鲁棒性的自动驾驶车辆3D目标检测系统的关键。此前,通过摄像头特征增强点云的点装饰方法一直是该领域的主流技术。然而,这类方法未能充分利用摄像头提供的高分辨率图像。近期研究虽然提出了将摄像头特征投影至鸟瞰图(BEV)空间进行融合的方案,但仍需处理数百万像素的投影运算,且其中绝大多数仅包含背景信息。本文提出了一种创新方法——中心特征融合(Center Feature Fusion, CFF),通过在摄像头与激光雷达数据流中均采用基于中心点的检测网络来识别相关目标位置。我们利用基于中心点的检测定位与目标位置相关的像素特征点,这些特征点仅占图像总像素数的极小部分,随后将其投影至BEV框架中完成融合。在nuScenes数据集上的实验表明,本方法在保持融合特征数量比其他方案减少100倍的前提下,相比纯激光雷达基线方法实现了4.9%的平均精度均值(mAP)提升。