This paper tries to address a fundamental question in point cloud self-supervised learning: what is a good signal we should leverage to learn features from point clouds without annotations? To answer that, we introduce a point cloud representation learning framework, based on geometric feature reconstruction. In contrast to recent papers that directly adopt masked autoencoder (MAE) and only predict original coordinates or occupancy from masked point clouds, our method revisits differences between images and point clouds and identifies three self-supervised learning objectives peculiar to point clouds, namely centroid prediction, normal estimation, and curvature prediction. Combined with occupancy prediction, these four objectives yield an nontrivial self-supervised learning task and mutually facilitate models to better reason fine-grained geometry of point clouds. Our pipeline is conceptually simple and it consists of two major steps: first, it randomly masks out groups of points, followed by a Transformer-based point cloud encoder; second, a lightweight Transformer decoder predicts centroid, normal, and curvature for points in each voxel. We transfer the pre-trained Transformer encoder to a downstream peception model. On the nuScene Datset, our model achieves 3.38 mAP improvment for object detection, 2.1 mIoU gain for segmentation, and 1.7 AMOTA gain for multi-object tracking. We also conduct experiments on the Waymo Open Dataset and achieve significant performance improvements over baselines as well.
翻译:本文试图解决点云自监督学习中的一个基本问题:在没有标注的情况下,应利用何种信号从点云中学习特征?为此,我们提出了一种基于几何特征重建的点云表示学习框架。与近期直接采用掩蔽自编码器(MAE)并仅从掩蔽点云中预测原始坐标或占有率的论文不同,我们的方法重新审视了图像与点云之间的差异,并确定了三种点云特有的自监督学习目标,即质心预测、法线估计和曲率预测。结合占有率预测,这四个目标构成了一项非平凡的自监督学习任务,并相互促进模型更好地推理点云的细粒度几何结构。我们的流程在概念上简单明了,包含两个主要步骤:首先,随机掩蔽点组,随后使用基于Transformer的点云编码器;其次,一个轻量级Transformer解码器预测每个体素内点的质心、法线和曲率。我们将预训练的Transformer编码器迁移至下游感知模型。在nuScene数据集上,我们的模型在目标检测中实现了3.38 mAP的提升,在分割中实现了2.1 mIoU的提升,在多目标跟踪中实现了1.7 AMOTA的提升。我们还在Waymo开放数据集上进行了实验,并相对于基线取得了显著的性能改进。