CLIP-GS：基于3D高斯泼溅的统一视觉-语言表征 (CLIP-GS: Unifying Vision-Language Representation with 3D Gaussian Splatting)

Recent works in 3D multimodal learning have made remarkable progress. However, typically 3D multimodal models are only capable of handling point clouds. Compared to the emerging 3D representation technique, 3D Gaussian Splatting (3DGS), the spatially sparse point cloud cannot depict the texture information of 3D objects, resulting in inferior reconstruction capabilities. This limitation constrains the potential of point cloud-based 3D multimodal representation learning. In this paper, we present CLIP-GS, a novel multimodal representation learning framework grounded in 3DGS. We introduce the GS Tokenizer to generate serialized gaussian tokens, which are then processed through transformer layers pre-initialized with weights from point cloud models, resulting in the 3DGS embeddings. CLIP-GS leverages contrastive loss between 3DGS and the visual-text embeddings of CLIP, and we introduce an image voting loss to guide the directionality and convergence of gradient optimization. Furthermore, we develop an efficient way to generate triplets of 3DGS, images, and text, facilitating CLIP-GS in learning unified multimodal representations. Leveraging the well-aligned multimodal representations, CLIP-GS demonstrates versatility and outperforms point cloud-based models on various 3D tasks, including multimodal retrieval, zero-shot, and few-shot classification.

翻译：近年来，三维多模态学习领域取得了显著进展。然而，现有的三维多模态模型通常仅能处理点云数据。与新兴的三维表征技术——三维高斯泼溅（3DGS）相比，空间稀疏的点云无法刻画三维物体的纹理信息，导致其重建能力较弱。这一局限性制约了基于点云的三维多模态表征学习的潜力。本文提出CLIP-GS，一种基于3DGS的新型多模态表征学习框架。我们引入高斯分词器（GS Tokenizer）以生成序列化的高斯令牌，随后通过使用点云模型权重预初始化的Transformer层进行处理，最终得到3DGS嵌入。CLIP-GS利用3DGS嵌入与CLIP视觉-文本嵌入之间的对比损失进行学习，并引入图像投票损失以引导梯度优化的方向性与收敛性。此外，我们开发了一种高效生成3DGS、图像和文本三元组的方法，从而促进CLIP-GS学习统一的多模态表征。凭借良好对齐的多模态表征，CLIP-GS展现出卓越的通用性，在多模态检索、零样本及少样本分类等多种三维任务上均超越了基于点云的模型。