In this paper, we propose the semantic graph Transformer (SGT) for the 3D scene graph generation. The task aims to parse a cloud point-based scene into a semantic structural graph, with the core challenge of modeling the complex global structure. Existing methods based on graph convolutional networks (GCNs) suffer from the over-smoothing dilemma and could only propagate information from limited neighboring nodes. In contrast, our SGT uses Transformer layers as the base building block to allow global information passing, with two types of proposed Transformer layers tailored for the 3D scene graph generation task. Specifically, we introduce the graph embedding layer to best utilize the global information in graph edges while maintaining comparable computation costs. Additionally, we propose the semantic injection layer to leverage categorical text labels and visual object knowledge. We benchmark our SGT on the established 3DSSG benchmark and achieve a 35.9% absolute improvement in relationship prediction's R@50 and an 80.40% boost on the subset with complex scenes over the state-of-the-art. Our analyses further show SGT's superiority in the long-tailed and zero-shot scenarios. We will release the code and model.
翻译:本文提出语义图Transformer(SGT),用于三维场景图生成。该任务旨在将基于点云的场景解析为语义结构图,其核心挑战在于建模复杂的全局结构。现有基于图卷积网络的方法受限于过平滑问题,仅能从有限邻域节点传播信息。相比之下,我们的SGT采用Transformer层作为基础构建模块以实现全局信息传递,并针对三维场景图生成任务设计了两种定制化的Transformer层。具体而言,我们引入图嵌入层,在保持相当计算成本的同时,充分利用图边中的全局信息。此外,我们提出语义注入层,利用分类文本标签和视觉对象知识。我们在权威的3DSSG基准上评估SGT,在关系预测的R@50指标上实现了35.9%的绝对提升,在复杂场景子集上相比现有最佳方法提升80.40%。进一步分析表明,SGT在长尾和零样本场景中具有显著优势。我们将公开代码和模型。