3D vision-language grounding, which focuses on aligning language with the 3D physical environment, stands as a cornerstone in the development of embodied agents. In comparison to recent advancements in the 2D domain, grounding language in 3D scenes faces several significant challenges: (i) the inherent complexity of 3D scenes due to the diverse object configurations, their rich attributes, and intricate relationships; (ii) the scarcity of paired 3D vision-language data to support grounded learning; and (iii) the absence of a unified learning framework to distill knowledge from grounded 3D data. In this work, we aim to address these three major challenges in 3D vision-language by examining the potential of systematically upscaling 3D vision-language learning in indoor environments. We introduce the first million-scale 3D vision-language dataset, SceneVerse, encompassing about 68K 3D indoor scenes and comprising 2.5M vision-language pairs derived from both human annotations and our scalable scene-graph-based generation approach. We demonstrate that this scaling allows for a unified pre-training framework, Grounded Pre-training for Scenes (GPS), for 3D vision-language learning. Through extensive experiments, we showcase the effectiveness of GPS by achieving state-of-the-art performance on all existing 3D visual grounding benchmarks. The vast potential of SceneVerse and GPS is unveiled through zero-shot transfer experiments in the challenging 3D vision-language tasks. Project website: https://scene-verse.github.io .
翻译:三维视觉-语言定位旨在将语言与三维物理环境对齐,是具身智能体发展的基石。相较于二维领域的近期进展,三维场景中的语言定位面临多重挑战:(i) 物体配置多样性、丰富属性及其复杂关系导致的三维场景固有复杂性;(ii) 支撑定位学习的配对三维视觉-语言数据匮乏;以及(iii) 缺乏统一的学习框架以从定位三维数据中提炼知识。本研究通过系统探究室内环境中三维视觉-语言学习规模化拓展的潜力,旨在解决上述三大挑战。我们首次提出百万级三维视觉-语言数据集SceneVerse,涵盖约6.8万个三维室内场景,包含通过人工标注与基于场景图的可扩展生成方法获得的250万对视觉-语言数据。研究表明,这一规模扩展可支持统一预训练框架GPS(场景定位预训练)用于三维视觉-语言学习。通过大量实验,GPS在所有现有三维视觉定位基准测试中均取得最先进性能,验证了其有效性。在具有挑战性的三维视觉-语言任务零样本迁移实验中,SceneVerse与GPS展现出巨大潜力。项目网站:https://scene-verse.github.io