Pix4Point: Image Pretrained Standard Transformers for 3D Point Cloud Understanding

While Transformers have achieved impressive success in natural language processing and computer vision, their performance on 3D point clouds is relatively poor. This is mainly due to the limitation of Transformers: a demanding need for extensive training data. Unfortunately, in the realm of 3D point clouds, the availability of large datasets is a challenge, exacerbating the issue of training Transformers for 3D tasks. In this work, we solve the data issue of point cloud Transformers from two perspectives: (i) introducing more inductive bias to reduce the dependency of Transformers on data, and (ii) relying on cross-modality pretraining. More specifically, we first present Progressive Point Patch Embedding and present a new point cloud Transformer model namely PViT. PViT shares the same backbone as Transformer but is shown to be less hungry for data, enabling Transformer to achieve performance comparable to the state-of-the-art. Second, we formulate a simple yet effective pipeline dubbed "Pix4Point" that allows harnessing Transformers pretrained in the image domain to enhance downstream point cloud understanding. This is achieved through a modality-agnostic Transformer backbone with the help of a tokenizer and decoder specialized in the different domains. Pretrained on a large number of widely available images, significant gains of PViT are observed in the tasks of 3D point cloud classification, part segmentation, and semantic segmentation on ScanObjectNN, ShapeNetPart, and S3DIS, respectively. Our code and models are available at https://github.com/guochengqian/Pix4Point .

翻译：尽管Transformer在自然语言处理和计算机视觉领域取得了令人瞩目的成功，但其在三维点云上的表现相对较差。这主要源于Transformer的局限性：对大规模训练数据具有强烈依赖性。然而在三维点云领域，大型数据集的可用性面临挑战，这使得Transformer在三维任务中的训练问题更加突出。本文从两个角度解决点云Transformer的数据问题：（i）引入更多归纳偏置以减少Transformer对数据的依赖，（ii）依赖跨模态预训练。具体而言，我们首先提出渐进式点块嵌入方法，并构建名为PViT的新型点云Transformer模型。PViT采用与标准Transformer相同的骨干结构，但展现出更低的数据饥饿特性，使Transformer能够达到与最先进技术相当的性能。其次，我们设计了一个名为“Pix4Point”的简洁高效流水线，该流水线可利用图像域预训练的Transformer来增强下游点云理解能力。这是通过采用模态无关的Transformer骨干结构，并辅以不同领域专用的分词器和解码器实现的。基于大量广泛可用的图像进行预训练后，PViT在ScanObjectNN三维点云分类任务、ShapeNetPart部件分割任务及S3DIS语义分割任务中均展现出显著性能提升。我们的代码与模型已开源至https://github.com/guochengqian/Pix4Point。