In the field of 3D scene understanding, 3D scene graphs have emerged as a new scene representation that combines geometric and semantic information about objects and their relationships. However, learning semantic 3D scene graphs in a fully supervised manner is inherently difficult as it requires not only object-level annotations but also relationship labels. While pre-training approaches have helped to boost the performance of many methods in various fields, pre-training for 3D scene graph prediction has received little attention. Furthermore, we find in this paper that classical contrastive point cloud-based pre-training approaches are ineffective for 3D scene graph learning. To this end, we present SGRec3D, a novel self-supervised pre-training method for 3D scene graph prediction. We propose to reconstruct the 3D input scene from a graph bottleneck as a pretext task. Pre-training SGRec3D does not require object relationship labels, making it possible to exploit large-scale 3D scene understanding datasets, which were off-limits for 3D scene graph learning before. Our experiments demonstrate that in contrast to recent point cloud-based pre-training approaches, our proposed pre-training improves the 3D scene graph prediction considerably, which results in SOTA performance, outperforming other 3D scene graph models by +10% on object prediction and +4% on relationship prediction. Additionally, we show that only using a small subset of 10% labeled data during fine-tuning is sufficient to outperform the same model without pre-training.
翻译:在3D场景理解领域,3D场景图已成为一种结合物体几何与语义信息及其相互关系的新型场景表示。然而,以全监督方式学习语义3D场景图本质上具有挑战性,因为它不仅需要物体级标注,还需要关系标签。尽管预训练方法在多个领域显著提升了算法性能,但针对3D场景图预测的预训练研究仍相对匮乏。此外,本文发现经典的基于对比点云的预训练方法对3D场景图学习效果不佳。为此,我们提出SGRec3D——一种用于3D场景图预测的新型自监督预训练方法。我们设计以图瓶颈重构3D输入场景作为预训练任务。SGRec3D的预训练无需物体关系标签,从而能够利用此前3D场景图学习无法触及的大规模3D场景理解数据集。实验表明,与近期基于点云的预训练方法相比,本方法显著提升了3D场景图预测性能,在物体预测和关系预测上分别超越其他3D场景图模型10%和4%,达到当前最优水平。此外,仅需使用10%标注数据进行微调,即可超越未经过预训练的相同模型。