Recent open-world 3D representation learning methods using Vision-Language Models (VLMs) to align 3D data with image-text information have shown superior 3D zero-shot performance. However, CAD-rendered images for this alignment often lack realism and texture variation, compromising alignment robustness. Moreover, the volume discrepancy between 3D and 2D pretraining datasets highlights the need for effective strategies to transfer the representational abilities of VLMs to 3D learning. In this paper, we present OpenDlign, a novel open-world 3D model using depth-aligned images generated from a diffusion model for robust multimodal alignment. These images exhibit greater texture diversity than CAD renderings due to the stochastic nature of the diffusion model. By refining the depth map projection pipeline and designing depth-specific prompts, OpenDlign leverages rich knowledge in pre-trained VLM for 3D representation learning with streamlined fine-tuning. Our experiments show that OpenDlign achieves high zero-shot and few-shot performance on diverse 3D tasks, despite only fine-tuning 6 million parameters on a limited ShapeNet dataset. In zero-shot classification, OpenDlign surpasses previous models by 8.0% on ModelNet40 and 16.4% on OmniObject3D. Additionally, using depth-aligned images for multimodal alignment consistently enhances the performance of other state-of-the-art models.
翻译:近期利用视觉语言模型(VLM)将三维数据与图像文本信息对齐的开放世界三维表征学习方法,已展现出卓越的三维零样本性能。然而,用于此类对齐的CAD渲染图像往往缺乏真实感与纹理多样性,从而削弱了对齐的鲁棒性。此外,三维与二维预训练数据集间的规模差异凸显了需要有效策略将VLM的表征能力迁移至三维学习。本文提出OpenDlign——一种新颖的开放世界三维模型,其采用扩散模型生成的深度对齐图像实现鲁棒的多模态对齐。得益于扩散模型的随机特性,这些图像相比CAD渲染图展现出更丰富的纹理多样性。通过优化深度图投影流程并设计深度专用提示词,OpenDlign以精简的微调方式,充分利用预训练VLM中的丰富知识进行三维表征学习。实验表明,尽管仅在有限规模的ShapeNet数据集上微调600万参数,OpenDlign在多样化的三维任务中仍实现了优异的零样本与小样本性能。在零样本分类任务中,OpenDlign在ModelNet40和OmniObject3D数据集上分别超越先前模型8.0%和16.4%。此外,采用深度对齐图像进行多模态对齐能持续提升其他前沿模型的性能。