TAMM: TriAdapter Multi-Modal Learning for 3D Shape Understanding

The limited scale of current 3D shape datasets hinders the advancements in 3D shape understanding, and motivates multi-modal learning approaches which transfer learned knowledge from data-abundant 2D image and language modalities to 3D shapes. However, even though the image and language representations have been aligned by cross-modal models like CLIP, we find that the image modality fails to contribute as much as the language in existing multi-modal 3D representation learning methods. This is attributed to the domain shift in the 2D images and the distinct focus of each modality. To more effectively leverage both modalities in the pre-training, we introduce TriAdapter Multi-Modal Learning (TAMM) -- a novel two-stage learning approach based on three synergetic adapters. First, our CLIP Image Adapter mitigates the domain gap between 3D-rendered images and natural images, by adapting the visual representations of CLIP for synthetic image-text pairs. Subsequently, our Dual Adapters decouple the 3D shape representation space into two complementary sub-spaces: one focusing on visual attributes and the other for semantic understanding, which ensure a more comprehensive and effective multi-modal pre-training. Extensive experiments demonstrate that TAMM consistently enhances 3D representations for a wide range of 3D encoder architectures, pre-training datasets, and downstream tasks. Notably, we boost the zero-shot classification accuracy on Objaverse-LVIS from 46.8 to 50.7, and improve the 5-way 10-shot linear probing classification accuracy on ModelNet40 from 96.1 to 99.0. Project page: \url{https://alanzhangcs.github.io/tamm-page}.

翻译：当前3D形状数据集的规模有限，这阻碍了3D形状理解领域的进展，并催生了从数据丰富的2D图像和语言模态向3D形状迁移知识的多模态学习方法。然而，尽管图像与语言表征已通过CLIP等跨模态模型实现对齐，我们发现在现有基于多模态的3D表征学习方法中，图像模态的贡献远不及语言模态。这归因于2D图像存在的域偏移及各模态的关注重点差异。为了在预训练中更有效利用两种模态，我们提出三适配器多模态学习（TAMM）——一种基于三个协同适配器的两阶段新颖学习方法。首先，我们的CLIP图像适配器通过适配CLIP在合成图像-文本对上的视觉表征，弥合3D渲染图像与自然图像之间的域差距。随后，我们的双适配器将3D形状表征空间解耦为两个互补子空间：一个聚焦视觉属性，另一个侧重语义理解，从而确保更全面有效的多模态预训练。大量实验表明，TAMM能够持续增强多种3D编码器架构、预训练数据集及下游任务的3D表征。尤其值得注意的是，我们在Objaverse-LVIS上将零样本分类准确率从46.8提升至50.7，在ModelNet40上将5-way 10-shot线性探测分类准确率从96.1提升至99.0。项目页面：\url{https://alanzhangcs.github.io/tamm-page}。