The goal of open-vocabulary detection is to identify novel objects based on arbitrary textual descriptions. In this paper, we address open-vocabulary 3D point-cloud detection by a dividing-and-conquering strategy, which involves: 1) developing a point-cloud detector that can learn a general representation for localizing various objects, and 2) connecting textual and point-cloud representations to enable the detector to classify novel object categories based on text prompting. Specifically, we resort to rich image pre-trained models, by which the point-cloud detector learns localizing objects under the supervision of predicted 2D bounding boxes from 2D pre-trained detectors. Moreover, we propose a novel de-biased triplet cross-modal contrastive learning to connect the modalities of image, point-cloud and text, thereby enabling the point-cloud detector to benefit from vision-language pre-trained models,i.e.,CLIP. The novel use of image and vision-language pre-trained models for point-cloud detectors allows for open-vocabulary 3D object detection without the need for 3D annotations. Experiments demonstrate that the proposed method improves at least 3.03 points and 7.47 points over a wide range of baselines on the ScanNet and SUN RGB-D datasets, respectively. Furthermore, we provide a comprehensive analysis to explain why our approach works.
翻译:开放词汇检测的目标是基于任意文本描述识别新颖物体。本文采用分治策略解决开放词汇3D点云检测问题,具体包括:1)开发能够学习通用表示以定位各类物体的点云检测器;2)连接文本与点云表示,使检测器能基于文本提示对新颖物体类别进行分类。具体而言,我们利用丰富的图像预训练模型,使点云检测器在2D预训练检测器预测的2D边界框监督下学习目标定位。此外,我们提出新型去偏三元组跨模态对比学习,连接图像、点云和文本模态,从而使点云检测器受益于视觉-语言预训练模型(如CLIP)。通过创新性地将图像与视觉-语言预训练模型应用于点云检测器,我们实现了无需3D标注的开放词汇3D目标检测。实验表明,所提方法在ScanNet和SUN RGB-D数据集上相比多种基线方法分别提升至少3.03和7.47个百分点。同时,我们提供全面分析以解释本方法的有效性。