In this paper, we propose a novel model called SGFormer, Semantic Graph TransFormer for point cloud-based 3D scene graph generation. The task aims to parse a point cloud-based scene into a semantic structural graph, with the core challenge of modeling the complex global structure. Existing methods based on graph convolutional networks (GCNs) suffer from the over-smoothing dilemma and can only propagate information from limited neighboring nodes. In contrast, SGFormer uses Transformer layers as the base building block to allow global information passing, with two types of newly-designed layers tailored for the 3D scene graph generation task. Specifically, we introduce the graph embedding layer to best utilize the global information in graph edges while maintaining comparable computation costs. Furthermore, we propose the semantic injection layer to leverage linguistic knowledge from large-scale language model (i.e., ChatGPT), to enhance objects' visual features. We benchmark our SGFormer on the established 3DSSG dataset and achieve a 40.94% absolute improvement in relationship prediction's R@50 and an 88.36% boost on the subset with complex scenes over the state-of-the-art. Our analyses further show SGFormer's superiority in the long-tail and zero-shot scenarios. Our source code is available at https://github.com/Andy20178/SGFormer.
翻译:摘要:本文提出一种名为SGFormer(语义图Transformer)的新型模型,用于基于点云的三维场景图生成。该任务旨在将点云场景解析为语义结构图,其核心挑战在于对复杂全局结构的建模。现有基于图卷积网络的方法受限于过平滑问题,仅能从有限邻域节点传播信息。与此不同,SGFormer采用Transformer层作为基础构建模块以实现全局信息传递,并针对三维场景图生成任务设计了两类新型网络层。具体而言,我们引入图嵌入层,在保持相近计算成本的同时,充分利用图边中的全局信息。此外,我们提出语义注入层,利用大规模语言模型(如ChatGPT)的语言知识增强物体视觉特征。我们在标准3DSSG数据集上对SGFormer进行基准测试,在关系预测的R@50指标上较现有最佳方法实现40.94%的绝对提升,在复杂场景子集上达到88.36%的增益。分析进一步表明SGFormer在长尾分布和零样本场景中的优越性。源代码已开源至https://github.com/Andy20178/SGFormer。