This paper introduces a novel self-supervised learning framework for enhancing 3D perception in autonomous driving scenes. Specifically, our approach, named NCLR, focuses on 2D-3D neural calibration, a novel pretext task that estimates the rigid transformation aligning camera and LiDAR coordinate systems. First, we propose the learnable transformation alignment to bridge the domain gap between image and point cloud data, converting features into a unified representation space for effective comparison and matching. Second, we identify the overlapping area between the image and point cloud with the fused features. Third, we establish dense 2D-3D correspondences to estimate the rigid transformation. The framework not only learns fine-grained matching from points to pixels but also achieves alignment of the image and point cloud at a holistic level, understanding their relative pose. We demonstrate NCLR's efficacy by applying the pre-trained backbone to downstream tasks, such as LiDAR-based 3D semantic segmentation, object detection, and panoptic segmentation. Comprehensive experiments on various datasets illustrate the superiority of NCLR over existing self-supervised methods. The results confirm that joint learning from different modalities significantly enhances the network's understanding abilities and effectiveness of learned representation. Code will be available at \url{https://github.com/Eaphan/NCLR}.
翻译:本文提出了一种新颖的自监督学习框架,用于增强自动驾驶场景中的三维感知能力。具体而言,我们的方法名为NCLR,专注于2D-3D神经校准——一种新型的预训练任务,通过估计对齐相机和LiDAR坐标系的刚体变换来实现。首先,我们提出可学习的变换对齐方法,以弥合图像与点云数据之间的领域差异,将特征转换到统一的表征空间中进行有效比较与匹配。其次,利用融合特征识别图像与点云之间的重叠区域。再次,建立密集的2D-3D对应关系以估计刚体变换。该框架不仅从点到像素的细粒度匹配中学习,还能在整体层面实现图像与点云的对齐,理解其相对位姿。我们将预训练主干网络应用于下游任务,如基于LiDAR的三维语义分割、目标检测和全景分割,从而验证了NCLR的有效性。在多个数据集上的全面实验表明,NCLR优于现有自监督方法。实验结果证实,跨模态联合学习能显著提升网络的理解能力与所学表征的有效性。代码将在\url{https://github.com/Eaphan/NCLR}提供。