Bringing Masked Autoencoders Explicit Contrastive Properties for Point Cloud Self-Supervised Learning

Contrastive learning (CL) for Vision Transformers (ViTs) in image domains has achieved performance comparable to CL for traditional convolutional backbones. However, in 3D point cloud pretraining with ViTs, masked autoencoder (MAE) modeling remains dominant. This raises the question: Can we take the best of both worlds? To answer this question, we first empirically validate that integrating MAE-based point cloud pre-training with the standard contrastive learning paradigm, even with meticulous design, can lead to a decrease in performance. To address this limitation, we reintroduce CL into the MAE-based point cloud pre-training paradigm by leveraging the inherent contrastive properties of MAE. Specifically, rather than relying on extensive data augmentation as commonly used in the image domain, we randomly mask the input tokens twice to generate contrastive input pairs. Subsequently, a weight-sharing encoder and two identically structured decoders are utilized to perform masked token reconstruction. Additionally, we propose that for an input token masked by both masks simultaneously, the reconstructed features should be as similar as possible. This naturally establishes an explicit contrastive constraint within the generative MAE-based pre-training paradigm, resulting in our proposed method, Point-CMAE. Consequently, Point-CMAE effectively enhances the representation quality and transfer performance compared to its MAE counterpart. Experimental evaluations across various downstream applications, including classification, part segmentation, and few-shot learning, demonstrate the efficacy of our framework in surpassing state-of-the-art techniques under standard ViTs and single-modal settings. The source code and trained models are available at: https://github.com/Amazingren/Point-CMAE.

翻译：在图像领域中，基于视觉Transformer（ViT）的对比学习（CL）已取得与传统卷积骨干网络相当的性能。然而，在基于ViT的三维点云预训练中，掩码自编码器（MAE）建模方法仍占据主导地位。这引发了一个问题：我们能否融合两种范式的优势？为回答此问题，我们首先通过实证验证：即使经过精心设计，将基于MAE的点云预训练与标准对比学习范式相结合也可能导致性能下降。为克服这一局限，我们通过挖掘MAE固有的对比特性，将CL重新引入基于MAE的点云预训练范式。具体而言，不同于图像领域常用的数据增强策略，我们通过对输入令牌进行两次随机掩码来生成对比输入对。随后，采用权重共享的编码器与两个结构相同的解码器执行掩码令牌重建任务。此外，我们提出：对于被两个掩码同时遮蔽的输入令牌，其重建特征应尽可能相似。这自然地在基于生成式MAE的预训练范式中建立了显式对比约束，从而形成我们提出的方法——Point-CMAE。因此，相较于原始MAE方法，Point-CMAE能有效提升表征质量与迁移性能。在分类、部件分割和小样本学习等多种下游任务上的实验评估表明，在标准ViT架构与单模态设置下，本框架能够超越现有先进技术。源代码与训练模型发布于：https://github.com/Amazingren/Point-CMAE。