Contrastive learning has emerged as a promising paradigm for 3D open-world understanding, jointly with text, image, and point cloud. In this paper, we introduce MixCon3D, which combines the complementary information between 2D images and 3D point clouds to enhance contrastive learning. With the further integration of multi-view 2D images, MixCon3D enhances the traditional tri-modal representation by offering a more accurate and comprehensive depiction of real-world 3D objects and bolstering text alignment. Additionally, we pioneer the first thorough investigation of various training recipes for the 3D contrastive learning paradigm, building a solid baseline with improved performance. Extensive experiments conducted on three representative benchmarks reveal that our method renders significant improvement over the baseline, surpassing the previous state-of-the-art performance on the challenging 1,156-category Objaverse-LVIS dataset by 5.7%. We further showcase the effectiveness of our approach in more applications, including text-to-3D retrieval and point cloud captioning. The code is available at https://github.com/UCSC-VLAA/MixCon3D.
翻译:对比学习已成为三维开放世界理解中结合文本、图像和点云的一种有前景范式。本文提出MixCon3D,通过整合二维图像与三维点云之间的互补信息来增强对比学习。进一步集成多视图二维图像后,MixCon3D以更准确、更全面的方式描绘真实三维物体,并强化文本对齐,从而提升了传统三模态表示的性能。此外,我们首次系统探究了三维对比学习范式中多种训练策略,构建了具有改进性能的坚实基线。在三个代表性基准上的大量实验表明,我们的方法相较于基线有显著提升,在包含1156个类别的具有挑战性的Objaverse-LVIS数据集上超越了此前最优性能达5.7%。我们进一步在文本到三维检索和点云描述等更多应用中展示了该方法的有效性。代码已开源至https://github.com/UCSC-VLAA/MixCon3D。