Recent advances in Vision and Language Models (VLMs) have improved open-world 3D representation, facilitating 3D zero-shot capability in unseen categories. Existing open-world methods pre-train an extra 3D encoder to align features from 3D data (e.g., depth maps or point clouds) with CAD-rendered images and corresponding texts. However, the limited color and texture variations in CAD images can compromise the alignment robustness. Furthermore, the volume discrepancy between pre-training datasets of the 3D encoder and VLM leads to sub-optimal 2D to 3D knowledge transfer. To overcome these issues, we propose OpenDlign, a novel framework for learning open-world 3D representations, that leverages depth-aligned images generated from point cloud-projected depth maps. Unlike CAD-rendered images, our generated images provide rich, realistic color and texture diversity while preserving geometric and semantic consistency with the depth maps. OpenDlign also optimizes depth map projection and integrates depth-specific text prompts, improving 2D VLM knowledge adaptation for 3D learning efficient fine-tuning. Experimental results show that OpenDlign significantly outperforms existing benchmarks in zero-shot and few-shot 3D tasks, exceeding prior scores by 8.0% on ModelNet40 and 16.4% on OmniObject3D with just 6 million tuned parameters. Moreover, integrating generated depth-aligned images into existing 3D learning pipelines consistently improves their performance.
翻译:近年来,视觉与语言模型(VLM)的进展提升了开放世界三维表征能力,促进了未见类别的三维零样本学习。现有开放世界方法通过预训练额外的三维编码器,将三维数据(如深度图或点云)的特征与CAD渲染图像及其对应文本进行对齐。然而,CAD图像有限的色彩与纹理变化会损害对齐的鲁棒性。此外,三维编码器与VLM预训练数据集之间的规模差异导致二维到三维知识迁移效果欠佳。为解决这些问题,我们提出OpenDlign——一种利用点云投影深度图生成的深度对齐图像来学习开放世界三维表征的新框架。与CAD渲染图像不同,我们生成的图像在保持几何与语义一致性的同时,提供了丰富、逼真的色彩与纹理多样性。OpenDlign还优化了深度图投影并整合了深度特定文本提示,提升了二维VLM知识适应三维学习的高效微调能力。实验结果表明,OpenDlign在零样本与少样本三维任务中显著超越现有基准,在仅调优600万参数的情况下,在ModelNet40和OmniObject3D上分别超出先前最佳成绩8.0%和16.4%。此外,将生成的深度对齐图像集成到现有三维学习流程中,可稳定提升其性能。