We present a novel task for cross-dataset visual grounding in 3D scenes (Cross3DVG), which overcomes limitations of existing 3D visual grounding models, specifically their restricted 3D resources and consequent tendencies of overfitting a specific 3D dataset. We created RIORefer, a large-scale 3D visual grounding dataset, to facilitate Cross3DVG. It includes more than 63k diverse descriptions of 3D objects within 1,380 indoor RGB-D scans from 3RScan, with human annotations. After training the Cross3DVG model using the source 3D visual grounding dataset, we evaluate it without target labels using the target dataset with, e.g., different sensors, 3D reconstruction methods, and language annotators. Comprehensive experiments are conducted using established visual grounding models and with CLIP-based multi-view 2D and 3D integration designed to bridge gaps among 3D datasets. For Cross3DVG tasks, (i) cross-dataset 3D visual grounding exhibits significantly worse performance than learning and evaluation with a single dataset because of the 3D data and language variants across datasets. Moreover, (ii) better object detector and localization modules and fusing 3D data and multi-view CLIP-based image features can alleviate this lower performance. Our Cross3DVG task can provide a benchmark for developing robust 3D visual grounding models to handle diverse 3D scenes while leveraging deep language understanding.
翻译:我们提出了一项新颖的跨数据集3D场景视觉定位任务(Cross3DVG),该任务克服了现有3D视觉定位模型的局限性,特别是其受限的3D资源以及由此产生的过拟合特定3D数据集的倾向。为促进Cross3DVG研究,我们构建了大规模3D视觉定位数据集RIORefer,该数据集涵盖了来自3RScan的1,380个室内RGB-D扫描中超过6.3万条对3D物体的多样化描述,并包含人工标注。在利用源3D视觉定位数据集训练Cross3DVG模型后,我们无需目标标签即可在目标数据集上进行评估,该目标数据集可能采用不同的传感器、3D重建方法和语言标注策略。我们基于既有视觉定位模型以及为弥合3D数据集差异而设计的CLIP多视角2D与3D融合技术进行了全面实验。针对Cross3DVG任务发现:(i)由于数据集间的3D数据与语言变异,跨数据集3D视觉定位的性能显著低于单数据集学习与评估的结果;(ii)采用更优的目标检测与定位模块,并融合3D数据与基于CLIP的多视角图像特征,可缓解性能下降问题。我们的Cross3DVG任务可为开发鲁棒的3D视觉定位模型提供基准,使其既能处理多样化3D场景,又能充分利用深层语言理解能力。