With the overwhelming trend of mask image modeling led by MAE, generative pre-training has shown a remarkable potential to boost the performance of fundamental models in 2D vision. However, in 3D vision, the over-reliance on Transformer-based backbones and the unordered nature of point clouds have restricted the further development of generative pre-training. In this paper, we propose a novel 3D-to-2D generative pre-training method that is adaptable to any point cloud model. We propose to generate view images from different instructed poses via the cross-attention mechanism as the pre-training scheme. Generating view images has more precise supervision than its point cloud counterpart, thus assisting 3D backbones to have a finer comprehension of the geometrical structure and stereoscopic relations of the point cloud. Experimental results have proved the superiority of our proposed 3D-to-2D generative pre-training over previous pre-training methods. Our method is also effective in boosting the performance of architecture-oriented approaches, achieving state-of-the-art performance when fine-tuning on ScanObjectNN classification and ShapeNetPart segmentation tasks. Code is available at https://github.com/wangzy22/TAP.
翻译:受MAE引领的掩码图像建模趋势影响,生成式预训练展现出显著提升二维视觉基础模型性能的潜力。然而在三维视觉领域,对基于Transformer主干网络的过度依赖以及点云的无序特性限制了生成式预训练的进一步发展。本文提出一种新颖的3D到2D生成式预训练方法,可适配任意点云模型。我们通过跨注意力机制从不同指定姿态生成视图图像作为预训练方案。与点云对应方法相比,生成视图图像具有更精确的监督信号,从而帮助三维主干网络更精细地理解点云的几何结构与立体关系。实验结果表明,我们提出的3D到2D生成式预训练方法优于先前预训练方法。该方法还能有效提升架构导向方法的性能,在ScanObjectNN分类与ShapeNetPart分割任务微调时达到最优水平。代码已开源至https://github.com/wangzy22/TAP。