While LiDAR sensors have been succesfully applied to 3D object detection, the affordability of radar and camera sensors has led to a growing interest in fusiong radars and cameras for 3D object detection. However, previous radar-camera fusion models have not been able to fully utilize radar information in that initial 3D proposals were generated based on the camera features only and the instance-level fusion is subsequently conducted. In this paper, we propose radar-camera multi-level fusion (RCM-Fusion), which fuses radar and camera modalities at both the feature-level and instance-level to fully utilize radar information. At the feature-level, we propose a Radar Guided BEV Encoder which utilizes radar Bird's-Eye-View (BEV) features to transform image features into precise BEV representations and then adaptively combines the radar and camera BEV features. At the instance-level, we propose a Radar Grid Point Refinement module that reduces localization error by considering the characteristics of the radar point clouds. The experiments conducted on the public nuScenes dataset demonstrate that our proposed RCM-Fusion offers 11.8% performance gain in nuScenes detection score (NDS) over the camera-only baseline model and achieves state-of-the-art performaces among radar-camera fusion methods in the nuScenes 3D object detection benchmark. Code will be made publicly available.
翻译:尽管激光雷达传感器已成功应用于3D目标检测,但雷达与相机传感器的经济性使得融合两者进行3D目标检测日益受到关注。然而,现有雷达-相机融合模型未能充分挖掘雷达信息——其初始3D候选框仅基于相机特征生成,随后仅进行实例级融合。本文提出雷达-相机多层级融合(RCM-Fusion)方法,在特征级和实例级两个层面融合雷达与相机模态,从而充分挖掘雷达信息。在特征级,我们提出雷达引导的BEV编码器,利用雷达鸟瞰视图(BEV)特征将图像特征转化为精确的BEV表征,并自适应融合雷达与相机BEV特征。在实例级,我们提出雷达网格点精化模块,通过考量雷达点云特性降低定位误差。在公开nuScenes数据集上的实验表明,与仅使用相机的基线模型相比,所提出的RCM-Fusion在nuScenes检测分数(NDS)上提升11.8%,并在nuScenes 3D目标检测基准的雷达-相机融合方法中达到最优性能。相关代码将公开。