The primary requirement for cross-modal data fusion is the precise alignment of data from different sensors. However, the calibration between LiDAR point clouds and camera images is typically time-consuming and needs external calibration board or specific environmental features. Cross-modal registration effectively solves this problem by aligning the data directly without requiring external calibration. However, due to the domain gap between the point cloud and the image, existing methods rarely achieve satisfactory registration accuracy while maintaining real-time performance. To address this issue, we propose a framework that projects point clouds into several 2D representations for matching with camera images, which not only leverages the geometric characteristic of LiDAR point clouds more effectively but also bridge the domain gap between the point cloud and image. Moreover, to tackle the challenges of cross modal differences and the limited overlap between LiDAR point clouds and images in the image matching task, we introduce a multi-scale feature extraction network to effectively extract features from both camera images and the projection maps of LiDAR point cloud. Additionally, we propose a patch-to-pixel matching network to provide more effective supervision and achieve higher accuracy. We validate the performance of our model through experiments on the KITTI and nuScenes datasets. Our network achieves real-time performance and extremely high registration accuracy. On the KITTI dataset, our model achieves a registration accuracy rate of over 99\%.
翻译:跨模态数据融合的首要前提是实现不同传感器数据的精确对齐。然而,激光雷达点云与相机图像之间的标定通常耗时且需要外部标定板或特定环境特征。跨模态配准通过直接对齐数据而无需外部标定,有效解决了这一问题。但由于点云与图像之间存在域差异,现有方法难以在保持实时性能的同时达到令人满意的配准精度。为解决该问题,我们提出一种将点云投影为多种二维表示以与相机图像匹配的框架,该框架不仅更有效地利用了激光雷达点云的几何特性,同时弥合了点云与图像之间的域差异。此外,为应对图像匹配任务中跨模态差异及激光雷达点云与图像重叠区域有限的挑战,我们引入了多尺度特征提取网络,以有效提取相机图像和激光雷达点云投影图的特征。在此基础上,我们进一步提出块到像素匹配网络,以提供更有效的监督并实现更高精度。我们在KITTI和nuScenes数据集上通过实验验证了模型性能。我们的网络实现了实时处理能力与极高的配准精度。在KITTI数据集上,本模型的配准准确率超过99%。