While visual-language models have profoundly linked features between texts and images, the incorporation of 3D modality data, such as point clouds and 3D Gaussians, further enables pretraining for 3D-related tasks, e.g., cross-modal retrieval, zero-shot classification, and scene recognition. As challenges remain in extracting 3D modal features and bridging the gap between different modalities, we propose TIGaussian, a framework that harnesses 3D Gaussian Splatting (3DGS) characteristics to strengthen cross-modality alignment through multi-branch 3DGS tokenizer and modality-specific 3D feature alignment strategies. Specifically, our multi-branch 3DGS tokenizer decouples the intrinsic properties of 3DGS structures into compact latent representations, enabling more generalizable feature extraction. To further bridge the modality gap, we develop a bidirectional cross-modal alignment strategies: a multi-view feature fusion mechanism that leverages diffusion priors to resolve perspective ambiguity in image-3D alignment, while a text-3D projection module adaptively maps 3D features to text embedding space for better text-3D alignment. Extensive experiments on various datasets demonstrate the state-of-the-art performance of TIGaussian in multiple tasks.
翻译:尽管视觉语言模型已在文本与图像特征间建立深刻联系,但点云与3D高斯等三维模态数据的引入,进一步为跨模态检索、零样本分类和场景识别等三维相关任务提供了预训练可能。针对三维模态特征提取与跨模态鸿沟弥合的挑战,本文提出TIGaussian框架,该框架利用3D高斯泼溅(3DGS)的特性,通过多分支3DGS分词器与模态特异性三维特征对齐策略来增强跨模态对齐。具体而言,我们的多分支3DGS分词器将3DGS结构的内在属性解耦为紧凑的潜在表示,从而实现更具泛化能力的特征提取。为弥合模态鸿沟,我们设计了双向跨模态对齐策略:一方面通过多视图特征融合机制,利用扩散先验解决图像-3D对齐中的视角歧义问题;另一方面通过文本-3D投影模块,将三维特征自适应映射至文本嵌入空间以优化文本-3D对齐。在多数据集上的大量实验表明,TIGaussian在多项任务中均达到最先进的性能水平。