Pre-training is crucial in 3D-related fields such as autonomous driving where point cloud annotation is costly and challenging. Many recent studies on point cloud pre-training, however, have overlooked the issue of incompleteness, where only a fraction of the points are captured by LiDAR, leading to ambiguity during the training phase. On the other hand, images offer more comprehensive information and richer semantics that can bolster point cloud encoders in addressing the incompleteness issue inherent in point clouds. Yet, incorporating images into point cloud pre-training presents its own challenges due to occlusions, potentially causing misalignments between points and pixels. In this work, we propose PRED, a novel image-assisted pre-training framework for outdoor point clouds in an occlusion-aware manner. The main ingredient of our framework is a Birds-Eye-View (BEV) feature map conditioned semantic rendering, leveraging the semantics of images for supervision through neural rendering. We further enhance our model's performance by incorporating point-wise masking with a high mask ratio (95%). Extensive experiments demonstrate PRED's superiority over prior point cloud pre-training methods, providing significant improvements on various large-scale datasets for 3D perception tasks. Codes will be available at https://github.com/PRED4pc/PRED.
翻译:预训练在自动驾驶等三维相关领域至关重要,因为这些场景中点云标注成本高昂且颇具挑战。然而,近期许多关于点云预训练的研究忽视了数据不完整性问题——激光雷达仅能捕获部分点云,导致训练阶段存在歧义。另一方面,图像能提供更全面的信息和更丰富的语义,有助于点云编码器应对点云固有的不完整性问题。然而,将图像融入点云预训练会因遮挡问题带来新挑战,可能导致点与像素之间的错位。本文提出PRED——一种新颖的、基于遮挡感知机制的图像辅助室外点云预训练框架。该框架的核心是通过鸟瞰图(BEV)特征图条件约束的语义渲染技术,利用图像的语义信息通过神经渲染提供监督信号。我们进一步引入高掩码率(95%)的点级掩码机制来增强模型性能。大量实验表明,PRED在3D感知任务中优于现有各类点云预训练方法,并在多个大规模数据集上取得显著性能提升。代码将发布于 https://github.com/PRED4pc/PRED。