Knowledge-based visual question answering (KVQA) has been extensively studied to answer visual questions with external knowledge, e.g., knowledge graphs (KGs). While several attempts have been proposed to leverage large language models (LLMs) as an implicit knowledge source, it remains challenging since LLMs may generate hallucinations. Moreover, multiple knowledge sources, e.g., images, KGs and LLMs, cannot be readily aligned for complex scenarios. To tackle these, we present a novel modality-aware integration with LLMs for KVQA (MAIL). It carefully leverages multimodal knowledge for both image understanding and knowledge reasoning. Specifically, (i) we propose a two-stage prompting strategy with LLMs to densely embody the image into a scene graph with detailed visual features; (ii) We construct a coupled concept graph by linking the mentioned entities with external facts. (iii) A tailored pseudo-siamese graph medium fusion is designed for sufficient multimodal fusion. We utilize the shared mentioned entities in two graphs as mediums to bridge a tight inter-modal exchange, while maximally preserving insightful intra-modal learning by constraining the fusion within mediums. Extensive experiments on two benchmark datasets show the superiority of MAIL with 24x less resources.
翻译:知识型视觉问答(KVQA)已被广泛研究,旨在利用外部知识(如知识图谱KGs)回答视觉问题。尽管已有研究尝试将大语言模型(LLMs)作为隐式知识源,但由于LLMs可能产生幻觉,该任务仍具挑战性。此外,图像、知识图谱和LLMs等多知识源难以在复杂场景下实现有效对齐。为此,我们提出一种面向KVQA的新型模态感知集成LLMs方法(MAIL),该方法巧妙利用多模态知识同时进行图像理解与知识推理。具体而言:(i)我们设计了一种基于LLMs的两阶段提示策略,将图像密集编码为包含详细视觉特征的场景图;(ii)通过将提及实体与外部事实关联,构建耦合概念图;(iii)设计定制化伪孪生图介质融合机制实现充分多模态融合。该机制以两类图中的共享提及实体为介质,在约束融合范围于介质内部的同时,最大化保留模态内学习能力。在两个基准数据集上的大量实验表明,MAIL在资源消耗降低24倍的情况下仍具有卓越性能。