Despite the success of Transformer models in vision and language tasks, they often learn knowledge from enormous data implicitly and cannot utilize structured input data directly. On the other hand, structured learning approaches such as graph neural networks (GNNs) that integrate prior information can barely compete with Transformer models. In this work, we aim to benefit from both worlds and propose a novel Multimodal Graph Transformer for question answering tasks that requires performing reasoning across multiple modalities. We introduce a graph-involved plug-and-play quasi-attention mechanism to incorporate multimodal graph information, acquired from text and visual data, to the vanilla self-attention as effective prior. In particular, we construct the text graph, dense region graph, and semantic graph to generate adjacency matrices, and then compose them with input vision and language features to perform downstream reasoning. Such a way of regularizing self-attention with graph information significantly improves the inferring ability and helps align features from different modalities. We validate the effectiveness of Multimodal Graph Transformer over its Transformer baselines on GQA, VQAv2, and MultiModalQA datasets.
翻译:尽管Transformer模型在视觉和语言任务中取得了成功,但它们往往隐式地从海量数据中学习知识,无法直接利用结构化输入数据。另一方面,诸如图神经网络(GNN)这类整合先验信息的结构化学习方法,难以与Transformer模型竞争。在本工作中,我们旨在融合两者的优势,提出一种新颖的多模态图Transformer,用于需要跨多模态进行推理的问答任务。我们引入一种基于图的即插即用准注意力机制,将从文本和视觉数据中获取的多模态图信息作为有效先验,整合到标准自注意力中。具体而言,我们构建文本图、密集区域图和语义图以生成邻接矩阵,然后将其与输入的视觉和语言特征组合,进行下游推理。这种利用图信息正则化自注意力的方式显著提升了推理能力,并有助于对齐来自不同模态的特征。我们在GQA、VQAv2和MultiModalQA数据集上验证了多模态图Transformer相较于其Transformer基线的有效性。