Building 3D scene graphs has recently emerged as a topic in scene representation for several embodied AI applications to represent the world in a structured and rich manner. With their increased use in solving downstream tasks (eg, navigation and room rearrangement), can we leverage and recycle them for creating 3D maps of environments, a pivotal step in agent operation? We focus on the fundamental problem of aligning pairs of 3D scene graphs whose overlap can range from zero to partial and can contain arbitrary changes. We propose SGAligner, the first method for aligning pairs of 3D scene graphs that is robust to in-the-wild scenarios (ie, unknown overlap -- if any -- and changes in the environment). We get inspired by multi-modality knowledge graphs and use contrastive learning to learn a joint, multi-modal embedding space. We evaluate on the 3RScan dataset and further showcase that our method can be used for estimating the transformation between pairs of 3D scenes. Since benchmarks for these tasks are missing, we create them on this dataset. The code, benchmark, and trained models are available on the project website.
翻译:构建三维场景图近期成为多种具身AI应用中场景表示的研究热点,它能以结构化且丰富的方式表征世界。随着其在解决下游任务(如导航与房间重排)中的广泛应用,我们能否利用并复用这些场景图为环境构建三维地图——这是智能体运行的关键步骤?本文聚焦于对齐具有零重叠到部分重叠且包含任意变化的三维场景图对这一基础问题。我们提出SGAligner,这是首个能鲁棒处理野外场景(即未知重叠程度及环境变化)的三维场景图对齐方法。受多模态知识图谱启发,我们采用对比学习来学习联合的多模态嵌入空间。在3RScan数据集上的评估表明,我们的方法还可用于估计三维场景对之间的变换矩阵。鉴于该任务缺乏基准测试,我们基于该数据集构建了相应基准。代码、基准测试及预训练模型均已在项目网站开源。