Swin3D: A Pretrained Transformer Backbone for 3D Indoor Scene Understanding

Pretrained backbones with fine-tuning have been widely adopted in 2D vision and natural language processing tasks and demonstrated significant advantages to task-specific networks. In this paper, we present a pretrained 3D backbone, named Swin3D, which first outperforms all state-of-the-art methods in downstream 3D indoor scene understanding tasks. Our backbone network is based on a 3D Swin transformer and carefully designed to efficiently conduct self-attention on sparse voxels with linear memory complexity and capture the irregularity of point signals via generalized contextual relative positional embedding. Based on this backbone design, we pretrained a large Swin3D model on a synthetic Structured3D dataset that is 10 times larger than the ScanNet dataset and fine-tuned the pretrained model in various downstream real-world indoor scene understanding tasks. The results demonstrate that our model pretrained on the synthetic dataset not only exhibits good generality in both downstream segmentation and detection on real 3D point datasets, but also surpasses the state-of-the-art methods on downstream tasks after fine-tuning with +2.3 mIoU and +2.2 mIoU on S3DIS Area5 and 6-fold semantic segmentation, +2.1 mIoU on ScanNet segmentation (val), +1.9 [email protected] on ScanNet detection, +8.1 [email protected] on S3DIS detection. Our method demonstrates the great potential of pretrained 3D backbones with fine-tuning for 3D understanding tasks. The code and models are available at https://github.com/microsoft/Swin3D .

翻译：预训练骨干网络结合微调策略已在二维视觉和自然语言处理任务中广泛应用，并展现出相较于任务专用网络的显著优势。本文提出一种名为Swin3D的预训练三维骨干网络，该网络首次在所有三维室内场景理解下游任务中超越现有最优方法。我们的骨干网络基于三维Swin Transformer架构，通过精心设计实现了对稀疏体素的高效自注意力计算（线性内存复杂度），并借助广义上下文相对位置嵌入捕捉点信号的局部非规则性。基于该骨干网络，我们利用比ScanNet数据集大10倍的合成Structured3D数据集预训练了大规模Swin3D模型，并在多个真实室内场景理解下游任务中对预训练模型进行微调。结果表明：在合成数据集上预训练的模型不仅在真实三维点云数据集的语义分割与目标检测任务中展现出优异泛化能力，更在下游任务中全面超越现有最优方法——在S3DIS Area5与6折语义分割上分别提升+2.3 mIoU和+2.2 mIoU，在ScanNet分割验证集上提升+2.1 mIoU，在ScanNet检测任务上提升+1.9 [email protected]，在S3DIS检测任务上提升+8.1 [email protected]。本研究充分证明了预训练三维骨干网络结合微调策略在三维理解任务中的巨大潜力。代码与模型已开源于https://github.com/microsoft/Swin3D。