3D Shape represented as point cloud has achieve advancements in multimodal pre-training to align image and language descriptions, which is curial to object identification, classification, and retrieval. However, the discrete representations of point cloud lost the object's surface shape information and creates a gap between rendering results and 2D correspondences. To address this problem, we propose GS-CLIP for the first attempt to introduce 3DGS (3D Gaussian Splatting) into multimodal pre-training to enhance 3D representation. GS-CLIP leverages a pre-trained vision-language model for a learned common visual and textual space on massive real world image-text pairs and then learns a 3D Encoder for aligning 3DGS optimized per object. Additionally, a novel Gaussian-Aware Fusion is proposed to extract and fuse global explicit feature. As a general framework for language-image-3D pre-training, GS-CLIP is agnostic to 3D backbone networks. Experiments on challenging shows that GS-CLIP significantly improves the state-of-the-art, outperforming the previously best results.
翻译:以点云表示的三维形状在多模态预训练中取得了进展,能够对齐图像和语言描述,这对物体识别、分类和检索至关重要。然而,点云的离散表示丢失了物体的表面形状信息,并在渲染结果与二维对应之间造成差距。为解决这一问题,我们首次提出GS-CLIP,尝试将3DGS(三维高斯泼溅)引入多模态预训练以增强三维表示。GS-CLIP利用预训练的视觉语言模型从海量真实世界图像-文本对中学习共享的视觉-文本空间,随后训练三维编码器以对齐每个物体优化的3DGS表示。此外,我们提出了一种新颖的高斯感知融合方法,用于提取并融合全局显式特征。作为一个通用的语言-图像-三维预训练框架,GS-CLIP对三维骨干网络具有不可知性。在具有挑战性的实验结果表明,GS-CLIP显著提升了最先进水平,超越了此前的最佳结果。