Current popular backbones in computer vision, such as Vision Transformers (ViT) and ResNets are trained to perceive the world from 2D images. However, to more effectively understand 3D structural priors in 2D backbones, we propose Mask3D to leverage existing large-scale RGB-D data in a self-supervised pre-training to embed these 3D priors into 2D learned feature representations. In contrast to traditional 3D contrastive learning paradigms requiring 3D reconstructions or multi-view correspondences, our approach is simple: we formulate a pre-text reconstruction task by masking RGB and depth patches in individual RGB-D frames. We demonstrate the Mask3D is particularly effective in embedding 3D priors into the powerful 2D ViT backbone, enabling improved representation learning for various scene understanding tasks, such as semantic segmentation, instance segmentation and object detection. Experiments show that Mask3D notably outperforms existing self-supervised 3D pre-training approaches on ScanNet, NYUv2, and Cityscapes image understanding tasks, with an improvement of +6.5% mIoU against the state-of-the-art Pri3D on ScanNet image semantic segmentation.
翻译:当前计算机视觉中流行的骨干网络,如Vision Transformer(ViT)和ResNet,均通过二维图像训练来感知世界。然而,为了更有效地将三维结构先验知识融入二维骨干网络,我们提出Mask3D方法,利用现有大规模RGB-D数据以自监督预训练方式将这些三维先验嵌入二维学习到的特征表征中。与需要三维重建或多视角对应关系的传统三维对比学习范式不同,我们的方法简洁高效:通过遮蔽单帧RGB-D图像中的RGB块和深度块,构建一个 pretext 重构任务。实验证明,Mask3D能有效将三维先验嵌入强大的二维ViT骨干网络,从而提升语义分割、实例分割和物体检测等场景理解任务的表征学习质量。在ScanNet、NYUv2和Cityscapes图像理解任务上的实验表明,Mask3D显著优于现有自监督三维预训练方法,在ScanNet图像语义分割任务上,相较于最先进的Pri3D方法,mIoU提升了+6.5%。