Transformer-based Self-supervised Representation Learning methods learn generic features from unlabeled datasets for providing useful network initialization parameters for downstream tasks. Recently, self-supervised learning based upon masking local surface patches for 3D point cloud data has been under-explored. In this paper, we propose masked Autoencoders in 3D point cloud representation learning (abbreviated as MAE3D), a novel autoencoding paradigm for self-supervised learning. We first split the input point cloud into patches and mask a portion of them, then use our Patch Embedding Module to extract the features of unmasked patches. Secondly, we employ patch-wise MAE3D Transformers to learn both local features of point cloud patches and high-level contextual relationships between patches and complete the latent representations of masked patches. We use our Point Cloud Reconstruction Module with multi-task loss to complete the incomplete point cloud as a result. We conduct self-supervised pre-training on ShapeNet55 with the point cloud completion pre-text task and fine-tune the pre-trained model on ModelNet40 and ScanObjectNN (PB\_T50\_RS, the hardest variant). Comprehensive experiments demonstrate that the local features extracted by our MAE3D from point cloud patches are beneficial for downstream classification tasks, soundly outperforming state-of-the-art methods ($93.4\%$ and $86.2\%$ classification accuracy, respectively).
翻译:基于Transformer的自监督表示学习方法从无标签数据集中学习通用特征,为下游任务提供有效的网络初始化参数。然而,针对三维点云数据的局部面片掩码自监督学习研究尚不充分。本文提出了一种用于三维点云表示学习的掩码自编码器(简称MAE3D)——一种新型的自监督学习自编码范式。我们首先将输入点云分割成面片并掩码其中一部分,随后使用面片嵌入模块提取未掩码面片的特征。其次,我们采用基于面片的MAE3D Transformer来学习点云面片的局部特征以及面片间的高层上下文关系,并补全被掩码面片的潜在表示。通过结合多任务损失的点云重建模块,我们最终完成残缺点云的补全。我们在ShapeNet55上以点云补全预文本任务进行自监督预训练,并在ModelNet40和ScanObjectNN(最困难变体PB\_T50\_RS)上微调预训练模型。综合实验表明,我们的MAE3D从点云面片中提取的局部特征对下游分类任务具有显著优势,分别以93.4%和86.2%的分类准确率大幅超越当前最优方法。