3D visual grounding involves finding a target object in a 3D scene that corresponds to a given sentence query. Although many approaches have been proposed and achieved impressive performance, they all require dense object-sentence pair annotations in 3D point clouds, which are both time-consuming and expensive. To address the problem that fine-grained annotated data is difficult to obtain, we propose to leverage weakly supervised annotations to learn the 3D visual grounding model, i.e., only coarse scene-sentence correspondences are used to learn object-sentence links. To accomplish this, we design a novel semantic matching model that analyzes the semantic similarity between object proposals and sentences in a coarse-to-fine manner. Specifically, we first extract object proposals and coarsely select the top-K candidates based on feature and class similarity matrices. Next, we reconstruct the masked keywords of the sentence using each candidate one by one, and the reconstructed accuracy finely reflects the semantic similarity of each candidate to the query. Additionally, we distill the coarse-to-fine semantic matching knowledge into a typical two-stage 3D visual grounding model, which reduces inference costs and improves performance by taking full advantage of the well-studied structure of the existing architectures. We conduct extensive experiments on ScanRefer, Nr3D, and Sr3D, which demonstrate the effectiveness of our proposed method.
翻译:三维视觉定位涉及在三维场景中找到与给定句子查询对应的目标物体。尽管已有多种方法提出并取得了令人瞩目的性能,但它们都需要在三维点云中进行密集的物体-句子对标注,这既耗时又昂贵。为解决细粒度标注数据难以获取的问题,我们提出利用弱监督标注来学习三维视觉定位模型,即仅使用粗略的场景-句子对应关系来学习物体-句子关联。为此,我们设计了一种新颖的语义匹配模型,该模型以从粗到细的方式分析物体候选与句子之间的语义相似性。具体而言,我们首先提取物体候选,并基于特征和类别相似度矩阵粗略选择前K个候选。接着,我们逐一使用每个候选重构句子中被掩码的关键词,重构精度精细地反映了每个候选与查询的语义相似度。此外,我们将从粗到细的语义匹配知识蒸馏到典型的双阶段三维视觉定位模型中,通过充分利用现有架构中经过充分研究的结构来降低推理成本并提升性能。我们在ScanRefer、Nr3D和Sr3D数据集上进行了广泛实验,结果证明了所提方法的有效性。