ULIP: Learning a Unified Representation of Language, Images, and Point Clouds for 3D Understanding

The recognition capabilities of current state-of-the-art 3D models are limited by datasets with a small number of annotated data and a pre-defined set of categories. In its 2D counterpart, recent advances have shown that similar problems can be significantly alleviated by employing knowledge from other modalities, such as language. Inspired by this, leveraging multimodal information for 3D modality could be promising to improve 3D understanding under the restricted data regime, but this line of research is not well studied. Therefore, we introduce ULIP to learn a unified representation of image, text, and 3D point cloud by pre-training with object triplets from the three modalities. To overcome the shortage of training triplets, ULIP leverages a pre-trained vision-language model that has already learned a common visual and textual space by training with massive image-text pairs. Then, ULIP learns a 3D representation space aligned with the common image-text space, using a small number of automatically synthesized triplets. ULIP is agnostic to 3D backbone networks and can easily be integrated into any 3D architecture. Experiments show that ULIP effectively improves the performance of multiple recent 3D backbones by simply pre-training them on ShapeNet55 using our framework, achieving state-of-the-art performance in both standard 3D classification and zero-shot 3D classification on ModelNet40 and ScanObjectNN. ULIP also improves the performance of PointMLP by around 3% in 3D classification on ScanObjectNN, and outperforms PointCLIP by 28.8% on top-1 accuracy for zero-shot 3D classification on ModelNet40. Our code and pre-trained models are released at https://github.com/salesforce/ULIP.

翻译：摘要：当前最先进的三维模型识别能力受限于标注数据量少、类别预设的有限数据集。在二维领域，近期进展表明，借助语言等其他模态的知识可显著缓解此类问题。受此启发，利用多模态信息提升受限数据场景下的三维理解颇具前景，但该研究方向尚未得到充分探索。为此，我们提出ULIP方法，通过预训练三种模态的对象三元组，学习图像、文本与三维点云的统一表征。为克服训练三元组不足的问题，ULIP利用预训练的视觉-语言模型（该模型已通过海量图文对训练获得共享的视觉-文本空间），再基于少量自动合成的三元组，学习与共享图文空间对齐的三维表征空间。ULIP与三维骨干网络无关，可轻松集成至任意三维架构。实验表明，仅需在ShapeNet55数据集上使用我们框架预训练，ULIP即可有效提升多个近期三维骨干网络的性能，在ModelNet40与ScanObjectNN的标准三维分类和零样本三维分类任务中均达到最优水平。在ScanObjectNN的三维分类任务中，ULIP使PointMLP性能提升约3%；在ModelNet40的零样本三维分类任务中，ULIP的Top-1准确率比PointCLIP高出28.8%。我们的代码与预训练模型已开源至https://github.com/salesforce/ULIP。