In this study, we explore an alternative approach to enhance contrastive text-image-3D alignment in the absence of textual descriptions for 3D objects. We introduce two unsupervised methods, $I2I$ and $(I2L)^2$, which leverage CLIP knowledge about textual and 2D data to compute the neural perceived similarity between two 3D samples. We employ the proposed methods to mine 3D hard negatives, establishing a multimodal contrastive pipeline with hard negative weighting via a custom loss function. We train on different configurations of the proposed hard negative mining approach, and we evaluate the accuracy of our models in 3D classification and on the cross-modal retrieval benchmark, testing image-to-shape and shape-to-image retrieval. Results demonstrate that our approach, even without explicit text alignment, achieves comparable or superior performance on zero-shot and standard 3D classification, while significantly improving both image-to-shape and shape-to-image retrieval compared to previous methods.
翻译:本研究探索了一种在缺乏三维物体文本描述的情况下增强对比式文本-图像-三维对齐的替代方法。我们提出了两种无监督方法 $I2I$ 与 $(I2L)^2$,其利用CLIP关于文本与二维数据的知识来计算两个三维样本间的神经感知相似度。我们采用所提方法挖掘三维困难负样本,通过自定义损失函数建立具有困难负样本加权的多模态对比学习框架。我们在不同配置的困难负样本挖掘方案上进行训练,并通过三维分类任务及跨模态检索基准(测试图像到形状与形状到图像的检索性能)评估模型准确性。实验结果表明,即使在没有显式文本对齐的情况下,我们的方法在零样本与标准三维分类任务上取得了相当或更优的性能,同时在图像到形状与形状到图像检索任务上较先前方法均有显著提升。