D scene graphs are an emerging 3D scene representation, that models both the objects present in the scene as well as their relationships. However, learning 3D scene graphs is a challenging task because it requires not only object labels but also relationship annotations, which are very scarce in datasets. While it is widely accepted that pre-training is an effective approach to improve model performance in low data regimes, in this paper, we find that existing pre-training methods are ill-suited for 3D scene graphs. To solve this issue, we present the first language-based pre-training approach for 3D scene graphs, whereby we exploit the strong relationship between scene graphs and language. To this end, we leverage the language encoder of CLIP, a popular vision-language model, to distill its knowledge into our graph-based network. We formulate a contrastive pre-training, which aligns text embeddings of relationships (subject-predicate-object triplets) and predicted 3D graph features. Our method achieves state-of-the-art results on the main semantic 3D scene graph benchmark by showing improved effectiveness over pre-training baselines and outperforming all the existing fully supervised scene graph prediction methods by a significant margin. Furthermore, since our scene graph features are language-aligned, it allows us to query the language space of the features in a zero-shot manner. In this paper, we show an example of utilizing this property of the features to predict the room type of a scene without further training.
翻译:三维场景图是一种新兴的三维场景表示方法,它不仅建模场景中的物体,还建模物体之间的关系。然而,学习三维场景图是一项具有挑战性的任务,因为它不仅需要物体标签,还需要关系标注,而在数据集中这些标注非常稀缺。尽管预训练被广泛认为是低数据场景下提升模型性能的有效方法,但本文发现现有预训练方法并不适用于三维场景图。针对这一问题,我们提出了首个基于语言的三维场景图预训练方法,利用场景图与语言之间的强关联性。为此,我们借助流行视觉-语言模型CLIP的语言编码器,将其知识蒸馏到基于图的网络中。我们设计了一种对比预训练方法,将关系(主语-谓语-宾语三元组)的文本嵌入与预测的三维图特征对齐。在主流语义三维场景图基准上,该方法相比预训练基线方法展现出更优的有效性,并以显著优势超越所有现有全监督场景图预测方法,取得了最先进的成果。此外,由于我们的场景图特征与语言对齐,还能够以零样本方式查询特征的语言空间。本文展示了一个利用该特征属性无需额外训练即可预测场景房间类型的示例。