Open-vocabulary 3D object detection (OV-3Det) aims to generalize beyond the limited number of base categories labeled during the training phase. The biggest bottleneck is the scarcity of annotated 3D data, whereas 2D image datasets are abundant and richly annotated. Consequently, it is intuitive to leverage the wealth of annotations in 2D images to alleviate the inherent data scarcity in OV-3Det. In this paper, we push the task setup to its limits by exploring the potential of using solely 2D images to learn OV-3Det. The major challenges for this setup is the modality gap between training images and testing point clouds, which prevents effective integration of 2D knowledge into OV-3Det. To address this challenge, we propose a novel framework ImOV3D to leverage pseudo multimodal representation containing both images and point clouds (PC) to close the modality gap. The key of ImOV3D lies in flexible modality conversion where 2D images can be lifted into 3D using monocular depth estimation and can also be derived from 3D scenes through rendering. This allows unifying both training images and testing point clouds into a common image-PC representation, encompassing a wealth of 2D semantic information and also incorporating the depth and structural characteristics of 3D spatial data. We carefully conduct such conversion to minimize the domain gap between training and test cases. Extensive experiments on two benchmark datasets, SUNRGBD and ScanNet, show that ImOV3D significantly outperforms existing methods, even in the absence of ground truth 3D training data. With the inclusion of a minimal amount of real 3D data for fine-tuning, the performance also significantly surpasses previous state-of-the-art. Codes and pre-trained models are released on the https://github.com/yangtiming/ImOV3D.
翻译:开放词汇三维目标检测旨在泛化至超出训练阶段标注的有限基类范围。其最大瓶颈在于带标注三维数据的稀缺,而二维图像数据集则十分丰富且标注详尽。因此,利用二维图像中丰富的标注信息来缓解开放词汇三维目标检测固有的数据稀缺问题是一种直观的思路。在本文中,我们将任务设置推向极限,探索仅使用二维图像学习开放词汇三维目标检测的潜力。该设置面临的主要挑战是训练图像与测试点云之间的模态鸿沟,这阻碍了将二维知识有效整合到开放词汇三维目标检测中。为应对这一挑战,我们提出了一种新颖的框架ImOV3D,利用包含图像和点云的伪多模态表示来弥合模态鸿沟。ImOV3D的关键在于灵活的模态转换:二维图像可通过单目深度估计提升至三维,也可通过渲染从三维场景中导出。这使得训练图像和测试点云能够统一到一个共同的图像-点云表示中,该表示既蕴含丰富的二维语义信息,又融合了三维空间数据的深度与结构特征。我们精心执行此类转换,以最小化训练与测试案例之间的域差异。在SUNRGBD和ScanNet两个基准数据集上的大量实验表明,即使在缺乏真实三维训练数据的情况下,ImOV3D也显著优于现有方法。当加入少量真实三维数据进行微调后,其性能也显著超越了以往的最先进方法。代码与预训练模型发布于https://github.com/yangtiming/ImOV3D。