Cross3DVG: Baseline and Dataset for Cross-Dataset 3D Visual Grounding on Different RGB-D Scans

We present Cross3DVG, a novel task for cross-dataset visual grounding in 3D scenes, revealing the limitations of existing 3D visual grounding models using restricted 3D resources and thus easily overfit to a specific 3D dataset. To facilitate Cross3DVG, we have created a large-scale 3D visual grounding dataset containing more than 63k diverse descriptions of 3D objects within 1,380 indoor RGB-D scans from 3RScan with human annotations, paired with the existing 52k descriptions on ScanRefer. We perform Cross3DVG by training a model on the source 3D visual grounding dataset and then evaluating it on the target dataset constructed in different ways (e.g., different sensors, 3D reconstruction methods, and language annotators) without using target labels. We conduct comprehensive experiments using established visual grounding models, as well as a CLIP-based 2D-3D integration method, designed to bridge the gaps between 3D datasets. By performing Cross3DVG tasks, we found that (i) cross-dataset 3D visual grounding has significantly lower performance than learning and evaluation with a single dataset, suggesting much room for improvement in cross-dataset generalization of 3D visual grounding, (ii) better detectors and transformer-based localization modules for 3D grounding are beneficial for enhancing 3D grounding performance and (iii) fusing 2D-3D data using CLIP demonstrates further performance improvements. Our Cross3DVG task will provide a benchmark for developing robust 3D visual grounding models capable of handling diverse 3D scenes while leveraging deep language understanding.

翻译：我们提出了Cross3DVG，一个用于三维场景中跨数据集视觉定位的新任务，揭示了现有三维视觉定位模型受限于有限的三维资源，从而容易对特定三维数据集过拟合的局限性。为促进Cross3DVG任务，我们创建了一个大规模三维视觉定位数据集，包含来自3RScan中1,380个室内RGB-D扫描的超过63,000条关于三维物体的多样化人工标注描述，并与ScanRefer上现有的52,000条描述配对。我们通过源三维视觉定位数据集训练模型，然后在不同方式构建（如不同传感器、三维重建方法和语言标注者）的目标数据集上进行评估，且不使用目标标签，来实现跨数据集三维视觉定位。我们利用现有视觉定位模型以及一种基于CLIP的2D-3D融合方法（旨在弥合三维数据集之间的差异）进行了全面实验。通过执行Cross3DVG任务，我们发现：（i）跨数据集三维视觉定位的性能远低于在单一数据集上学习和评估的结果，表明三维视觉定位的跨数据集泛化能力仍有很大提升空间；（ii）更好的检测器和基于Transformer的定位模块有助于提升三维定位性能；（iii）利用CLIP融合2D-3D数据进一步提升了性能。我们的Cross3DVG任务将为开发鲁棒的三维视觉定位模型提供基准，这些模型能够处理多样化的三维场景并充分利用深层语言理解。