Beyond First Impressions: Integrating Joint Multi-modal Cues for Comprehensive 3D Representation

In recent years, 3D understanding has turned to 2D vision-language pre-trained models to overcome data scarcity challenges. However, existing methods simply transfer 2D alignment strategies, aligning 3D representations with single-view 2D images and coarse-grained parent category text. These approaches introduce information degradation and insufficient synergy issues, leading to performance loss. Information degradation arises from overlooking the fact that a 3D representation should be equivalent to a series of multi-view images and more fine-grained subcategory text. Insufficient synergy neglects the idea that a robust 3D representation should align with the joint vision-language space, rather than independently aligning with each modality. In this paper, we propose a multi-view joint modality modeling approach, termed JM3D, to obtain a unified representation for point cloud, text, and image. Specifically, a novel Structured Multimodal Organizer (SMO) is proposed to address the information degradation issue, which introduces contiguous multi-view images and hierarchical text to enrich the representation of vision and language modalities. A Joint Multi-modal Alignment (JMA) is designed to tackle the insufficient synergy problem, which models the joint modality by incorporating language knowledge into the visual modality. Extensive experiments on ModelNet40 and ScanObjectNN demonstrate the effectiveness of our proposed method, JM3D, which achieves state-of-the-art performance in zero-shot 3D classification. JM3D outperforms ULIP by approximately 4.3% on PointMLP and achieves an improvement of up to 6.5% accuracy on PointNet++ in top-1 accuracy for zero-shot 3D classification on ModelNet40. The source code and trained models for all our experiments are publicly available at https://github.com/Mr-Neko/JM3D.

翻译：近年来，三维理解转向利用二维视觉-语言预训练模型以克服数据稀缺挑战。然而，现有方法简单迁移二维对齐策略，将三维表征与单视角二维图像及粗粒度父类别文本对齐。此类方法引发信息退化与协同不充分问题，导致性能损失。信息退化源于忽视三维表征应等价于一系列多视角图像及更细粒度子类别文本这一事实；协同不充分则忽略了鲁棒的三维表征应与联合视觉-语言空间对齐，而非独立与各模态对齐。本文提出一种多视角联合模态建模方法JM3D，旨在获得点云、文本与图像的统一表征。具体而言，针对信息退化问题，提出新型结构化多模态组织器（SMO），引入连续多视角图像与层次化文本以丰富视觉与语言模态表征；针对协同不充分问题，设计联合多模态对齐（JMA），通过将语言知识融入视觉模态实现联合模态建模。在ModelNet40与ScanObjectNN上的大量实验表明，所提方法JM3D在零样本三维分类中达到最优性能。在ModelNet40数据集上，JM3D基于PointMLP的零样本三维分类Top-1准确率较ULIP提升约4.3%，基于PointNet++的准确率提升达6.5%。所有实验的源代码与训练模型已开源至https://github.com/Mr-Neko/JM3D。