Lidars and cameras are critical sensors that provide complementary information for 3D detection in autonomous driving. While most prevalent methods progressively downscale the 3D point clouds and camera images and then fuse the high-level features, the downscaled features inevitably lose low-level detailed information. In this paper, we propose Fine-Grained Lidar-Camera Fusion (FGFusion) that make full use of multi-scale features of image and point cloud and fuse them in a fine-grained way. First, we design a dual pathway hierarchy structure to extract both high-level semantic and low-level detailed features of the image. Second, an auxiliary network is introduced to guide point cloud features to better learn the fine-grained spatial information. Finally, we propose multi-scale fusion (MSF) to fuse the last N feature maps of image and point cloud. Extensive experiments on two popular autonomous driving benchmarks, i.e. KITTI and Waymo, demonstrate the effectiveness of our method.
翻译:激光雷达与相机是自动驾驶中为三维检测提供互补信息的关键传感器。当前主流方法逐步降维处理三维点云与相机图像并融合高层特征,但降维后的特征不可避免地丢失了底层细节信息。本文提出细粒度激光雷达-相机融合方法(Fine-Grained Lidar-Camera Fusion, FGFusion),旨在充分利用图像与点云的多尺度特征,实现细粒度融合。首先,我们设计双通路层级结构以同时提取图像的高层语义特征与底层细节特征;其次,引入辅助网络引导点云特征更好地学习细粒度空间信息;最后,提出多尺度融合(MSF)模块,对图像与点云的最后N层特征图进行融合。在KITTI与Waymo两个主流自动驾驶基准数据集上的大量实验表明,本方法具有显著有效性。