Foundation models have made significant strides in 2D and language tasks such as image segmentation, object detection, and visual-language understanding. Nevertheless, their potential to enhance 3D scene representation learning remains largely untapped due to the domain gap. In this paper, we propose an innovative methodology Bridge3D to address this gap, pre-training 3D models using features, semantic masks, and captions sourced from foundation models. Specifically, our approach utilizes semantic masks from these models to guide the masking and reconstruction process in the masked autoencoder. This strategy enables the network to concentrate more on foreground objects, thereby enhancing 3D representation learning. Additionally, we bridge the 3D-text gap at the scene level by harnessing image captioning foundation models. To further facilitate knowledge distillation from well-learned 2D and text representations to the 3D model, we introduce a novel method that employs foundation models to generate highly accurate object-level masks and semantic text information at the object level. Our approach notably outshines state-of-the-art methods in 3D object detection and semantic segmentation tasks. For instance, on the ScanNet dataset, our method surpasses the previous state-of-the-art method, PiMAE, by a significant margin of 5.3%.
翻译:基础模型在图像分割、目标检测和视觉-语言理解等二维及语言任务中取得了显著进展。然而,由于领域差异的存在,这些模型在提升三维场景表示学习方面的潜力尚未被充分挖掘。本文提出一种创新方法Bridge3D来解决这一鸿沟,利用从基础模型中提取的特征、语义掩码和描述文本对3D模型进行预训练。具体而言,我们的方法利用基础模型生成的语义掩码来引导掩码自编码器中的掩码与重建过程。该策略使网络能够更关注前景物体,从而增强3D表示学习。此外,我们通过利用图像描述基础模型,在场景层面弥合了3D与文本之间的差异。为了进一步促进从已充分学习的二维和文本表示向3D模型的知识蒸馏,我们引入了一种新颖方法,利用基础模型生成高精度的物体级掩码及物体级语义文本信息。我们的方法在3D目标检测与语义分割任务中显著优于现有最优方法。例如,在ScanNet数据集上,我们的方法以5.3%的显著优势超越了此前最优方法PiMAE。