LiDAR point clouds have become the most common data source in autonomous driving. However, due to the sparsity of point clouds, accurate and reliable detection cannot be achieved in specific scenarios. Because of their complementarity with point clouds, images are getting increasing attention. Although with some success, existing fusion methods either perform hard fusion or do not fuse in a direct manner. In this paper, we propose a generic 3D detection framework called MMFusion, using multi-modal features. The framework aims to achieve accurate fusion between LiDAR and images to improve 3D detection in complex scenes. Our framework consists of two separate streams: the LiDAR stream and the camera stream, which can be compatible with any single-modal feature extraction network. The Voxel Local Perception Module in the LiDAR stream enhances local feature representation, and then the Multi-modal Feature Fusion Module selectively combines feature output from different streams to achieve better fusion. Extensive experiments have shown that our framework not only outperforms existing benchmarks but also improves their detection, especially for detecting cyclists and pedestrians on KITTI benchmarks, with strong robustness and generalization capabilities. Hopefully, our work will stimulate more research into multi-modal fusion for autonomous driving tasks.
翻译:激光雷达点云已成为自动驾驶中最常见的数据来源。然而,由于点云的稀疏性,在特定场景下无法实现准确可靠的检测。由于图像与点云具有互补性,图像正受到越来越多的关注。尽管取得了一些成功,现有的融合方法要么执行硬融合,要么并非以直接方式进行融合。本文提出了一种名为MMFusion的通用3D检测框架,该框架利用多模态特征,旨在实现激光雷达与图像之间的精确融合,以提高复杂场景下的3D检测性能。我们的框架由两个独立的信息流组成:激光雷达流和相机流,这两个信息流可兼容任何单模态特征提取网络。激光雷达流中的体素局部感知模块增强了局部特征表示,随后多模态特征融合模块选择性地组合来自不同信息流的特征输出,以实现更优的融合效果。大量实验表明,我们的框架不仅优于现有基准方法,还提升了其检测性能,尤其是在KITTI基准测试中对骑车人和行人的检测,展现出强大的鲁棒性和泛化能力。期望我们的工作能激发更多关于自动驾驶任务中多模态融合的研究。