We propose a new self-supervised method for pre-training the backbone of deep perception models operating on point clouds. The core idea is to train the model on a pretext task which is the reconstruction of the surface on which the 3D points are sampled, and to use the underlying latent vectors as input to the perception head. The intuition is that if the network is able to reconstruct the scene surface, given only sparse input points, then it probably also captures some fragments of semantic information, that can be used to boost an actual perception task. This principle has a very simple formulation, which makes it both easy to implement and widely applicable to a large range of 3D sensors and deep networks performing semantic segmentation or object detection. In fact, it supports a single-stream pipeline, as opposed to most contrastive learning approaches, allowing training on limited resources. We conducted extensive experiments on various autonomous driving datasets, involving very different kinds of lidars, for both semantic segmentation and object detection. The results show the effectiveness of our method to learn useful representations without any annotation, compared to existing approaches. Code is available at https://github.com/valeoai/ALSO
翻译:我们提出了一种新的自监督方法,用于预训练处理点云的深度感知模型的主干网络。核心思想是通过一个前置任务训练模型,即重建3D点采样所在的表面,并将由此得到的潜在特征向量输入感知头。直觉上,如果网络能够仅从稀疏输入点重建场景表面,那么它可能也捕捉到了部分语义信息片段,这些信息可用于提升实际的感知任务。这一原理的公式表述非常简单,便于实现,且广泛适用于各类3D传感器以及执行语义分割或目标检测的深度网络。事实上,与大多数对比学习方法不同,它支持单流流水线,从而能在有限资源下进行训练。我们在多个自动驾驶数据集上开展了大量实验,涉及多种不同类型的激光雷达,涵盖语义分割和目标检测任务。结果表明,与现有方法相比,我们的方法无需任何标注即可学习到有用的表征。代码开源在:https://github.com/valeoai/ALSO