Structure Guided Multi-modal Pre-trained Transformer for Knowledge Graph Reasoning

from arxiv, This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessed

Multimodal knowledge graphs (MKGs), which intuitively organize information in various modalities, can benefit multiple practical downstream tasks, such as recommendation systems, and visual question answering. However, most MKGs are still far from complete, which motivates the flourishing of MKG reasoning models. Recently, with the development of general artificial architectures, the pretrained transformer models have drawn increasing attention, especially for multimodal scenarios. However, the research of multimodal pretrained transformer (MPT) for knowledge graph reasoning (KGR) is still at an early stage. As the biggest difference between MKG and other multimodal data, the rich structural information underlying the MKG still cannot be fully leveraged in existing MPT models. Most of them only utilize the graph structure as a retrieval map for matching images and texts connected with the same entity. This manner hinders their reasoning performances. To this end, we propose the graph Structure Guided Multimodal Pretrained Transformer for knowledge graph reasoning, termed SGMPT. Specifically, the graph structure encoder is adopted for structural feature encoding. Then, a structure-guided fusion module with two different strategies, i.e., weighted summation and alignment constraint, is first designed to inject the structural information into both the textual and visual features. To the best of our knowledge, SGMPT is the first MPT model for multimodal KGR, which mines the structural information underlying the knowledge graph. Extensive experiments on FB15k-237-IMG and WN18-IMG, demonstrate that our SGMPT outperforms existing state-of-the-art models, and prove the effectiveness of the designed strategies.

翻译：多模态知识图谱直观地组织各种模态的信息，能够促进推荐系统和视觉问答等多种实际下游任务。然而，大多数多模态知识图谱仍远未完善，这推动了多模态知识图谱推理模型的蓬勃发展。近年来，随着通用人工智能架构的发展，预训练Transformer模型，特别是在多模态场景中，引起了越来越多的关注。然而，用于知识图谱推理的多模态预训练Transformer的研究仍处于早期阶段。作为多模态知识图谱与其他多模态数据最大的区别，现有多模态预训练Transformer模型无法充分利用其丰富的结构信息，其中大多数模型仅将图结构用作匹配与同一实体连接的图像和文本的检索图，这种处理方式限制了其推理性能。为此，我们提出了用于知识图谱推理的图结构引导多模态预训练Transformer（SGMPT）。具体而言，采用图结构编码器进行结构特征编码，然后首次设计了具有两种不同策略（即加权求和和对齐约束）的结构引导融合模块，将结构信息注入文本和视觉特征中。据我们所知，SGMPT是首个用于多模态知识图谱推理的MPT模型，能够挖掘知识图谱背后的结构信息。在FB15k-237-IMG和WN18-IMG上的大量实验表明，我们的SGMPT优于现有的最先进模型，并证明了所设计策略的有效性。