D semantic scene graphs are a powerful holistic representation as they describe the individual objects and depict the relation between them. They are compact high-level graphs that enable many tasks requiring scene reasoning. In real-world settings, existing 3D estimation methods produce robust predictions that mostly rely on dense inputs. In this work, we propose a real-time framework that incrementally builds a consistent 3D semantic scene graph of a scene given an RGB image sequence. Our method consists of a novel incremental entity estimation pipeline and a scene graph prediction network. The proposed pipeline simultaneously reconstructs a sparse point map and fuses entity estimation from the input images. The proposed network estimates 3D semantic scene graphs with iterative message passing using multi-view and geometric features extracted from the scene entities. Extensive experiments on the 3RScan dataset show the effectiveness of the proposed method in this challenging task, outperforming state-of-the-art approaches.
翻译:三维语义场景图是一种强大的整体性表征,它不仅描述了场景中的单个物体,还刻画了物体之间的相互关系。这些紧凑的高层级图结构能够支持诸多需要场景推理的任务。在真实场景中,现有的三维估计方法主要依赖稠密输入数据来生成鲁棒的预测结果。本文提出一种实时框架,能够基于RGB图像序列增量式地构建场景的一致三维语义场景图。该方法包含一个新颖的增量式实体估计流水线和一个场景图预测网络。所提出的流水线可同时重建稀疏点云图并融合输入图像中的实体估计结果。而场景图预测网络则利用从场景实体中提取的多视角及几何特征,通过迭代消息传递机制来估计三维语义场景图。在3RScan数据集上的大量实验表明,该方法在该具有挑战性的任务中表现优异,其性能超越了现有最先进的方法。