Visual question answering (VQA) requires systems to perform concept-level reasoning by unifying unstructured (e.g., the context in question and answer; "QA context") and structured (e.g., knowledge graph for the QA context and scene; "concept graph") multimodal knowledge. Existing works typically combine a scene graph and a concept graph of the scene by connecting corresponding visual nodes and concept nodes, then incorporate the QA context representation to perform question answering. However, these methods only perform a unidirectional fusion from unstructured knowledge to structured knowledge, limiting their potential to capture joint reasoning over the heterogeneous modalities of knowledge. To perform more expressive reasoning, we propose VQA-GNN, a new VQA model that performs bidirectional fusion between unstructured and structured multimodal knowledge to obtain unified knowledge representations. Specifically, we inter-connect the scene graph and the concept graph through a super node that represents the QA context, and introduce a new multimodal GNN technique to perform inter-modal message passing for reasoning that mitigates representational gaps between modalities. On two challenging VQA tasks (VCR and GQA), our method outperforms strong baseline VQA methods by 3.2% on VCR (Q-AR) and 4.6% on GQA, suggesting its strength in performing concept-level reasoning. Ablation studies further demonstrate the efficacy of the bidirectional fusion and multimodal GNN method in unifying unstructured and structured multimodal knowledge.
翻译:视觉问答(VQA)要求系统通过整合非结构化(例如问答上下文;"QA上下文")和结构化(例如问答上下文及场景的知识图谱;"概念图")多模态知识进行概念级推理。现有方法通常通过连接场景图与场景概念图中对应的视觉节点和概念节点,再结合QA上下文表示进行问答。然而,这些方法仅执行从非结构化知识到结构化知识的单向融合,限制了其在异构知识模态间进行联合推理的潜力。为实现更具表现力的推理,我们提出VQA-GNN——一种新型VQA模型,通过非结构化与结构化多模态知识的双向融合获得统一知识表示。具体而言,我们通过代表QA上下文的超节点互联场景图与概念图,并引入新型多模态GNN技术执行跨模态消息传递以进行推理,从而弥合模态间的表征差异。在两个具有挑战性的VQA任务(VCR和GQA)中,我们的方法在VCR(Q-AR)和GQA上分别以3.2%和4.6%的绝对准确率提升超越强基线VQA方法,彰显其在概念级推理中的优势。消融实验进一步证明了双向融合及多模态GNN方法在统一非结构化与结构化多模态知识中的有效性。