Masked autoencoder has been widely explored in point cloud self-supervised learning, whereby the point cloud is generally divided into visible and masked parts. These methods typically include an encoder accepting visible patches (normalized) and corresponding patch centers (position) as input, with the decoder accepting the output of the encoder and the centers (position) of the masked parts to reconstruct each point in the masked patches. Then, the pre-trained encoders are used for downstream tasks. In this paper, we show a motivating empirical result that when directly feeding the centers of masked patches to the decoder without information from the encoder, it still reconstructs well. In other words, the centers of patches are important and the reconstruction objective does not necessarily rely on representations of the encoder, thus preventing the encoder from learning semantic representations. Based on this key observation, we propose a simple yet effective method, i.e., learning to Predict Centers for Point Masked AutoEncoders (PCP-MAE) which guides the model to learn to predict the significant centers and use the predicted centers to replace the directly provided centers. Specifically, we propose a Predicting Center Module (PCM) that shares parameters with the original encoder with extra cross-attention to predict centers. Our method is of high pre-training efficiency compared to other alternatives and achieves great improvement over Point-MAE, particularly surpassing it by 5.50% on OBJ-BG, 6.03% on OBJ-ONLY, and 5.17% on PB-T50-RS for 3D object classification on the ScanObjectNN dataset. The code is available at https://github.com/aHapBean/PCP-MAE.
翻译:掩码自编码器在点云自监督学习中已被广泛探索,其通常将点云划分为可见部分与掩码部分。这些方法通常包含一个编码器,其输入为可见块(归一化后)及其对应的块中心(位置信息);解码器则接收编码器的输出以及掩码部分的中心(位置信息),以重建掩码块中的每个点。随后,预训练完成的编码器被用于下游任务。本文展示了一项具有启发性的实证结果:当直接将掩码块的中心输入解码器而无需编码器的信息时,其仍能实现良好的重建效果。换言之,块中心至关重要,而重建目标并不必然依赖于编码器的表征,这反而阻碍了编码器学习语义表征。基于这一关键观察,我们提出了一种简单而有效的方法,即学习为点云掩码自编码器预测中心(PCP-MAE),该方法引导模型学习预测关键中心,并使用预测中心替代直接提供的中心。具体而言,我们提出了一个预测中心模块(PCM),该模块与原始编码器共享参数,并通过额外的交叉注意力机制来预测中心。相较于其他替代方案,我们的方法具有较高的预训练效率,并在Point-MAE基础上取得了显著提升,特别是在ScanObjectNN数据集上的3D物体分类任务中,于OBJ-BG、OBJ-ONLY和PB-T50-RS三个子集上分别超出5.50%、6.03%和5.17%。代码发布于https://github.com/aHapBean/PCP-MAE。