Knowledge-based visual question answering (KVQA) has been extensively studied to answer visual questions with external knowledge, e.g., knowledge graphs (KGs). While several attempts have been proposed to leverage large language models (LLMs) as an implicit knowledge source, it remains challenging since LLMs may generate hallucinations. Moreover, multiple knowledge sources, e.g., images, KGs and LLMs, cannot be readily aligned for complex scenarios. To tackle these, we present a novel modality-aware integration with LLMs for KVQA (MAIL). It carefully leverages multimodal knowledge for both image understanding and knowledge reasoning. Specifically, (i) we propose a two-stage prompting strategy with LLMs to densely embody the image into a scene graph with detailed visual features; (ii) We construct a coupled concept graph by linking the mentioned entities with external facts. (iii) A tailored pseudo-siamese graph medium fusion is designed for sufficient multimodal fusion. We utilize the shared mentioned entities in two graphs as mediums to bridge a tight inter-modal exchange, while maximally preserving insightful intra-modal learning by constraining the fusion within mediums. Extensive experiments on two benchmark datasets show the superiority of MAIL with 24x less resources.
翻译:知识型视觉问答(KVQA)旨在利用外部知识(如知识图谱KG)回答视觉问题,已得到广泛研究。尽管已有多种尝试利用大语言模型(LLM)作为隐式知识源,但由于LLM可能产生幻觉,该任务仍具挑战性。此外,图像、KG和LLM等多重知识源在复杂场景中难以直接对齐。为此,我们提出一种新颖的模态感知大语言模型集成方法(MAIL)用于KVQA。该方法巧妙利用多模态知识同时实现图像理解与知识推理。具体而言:(i)我们提出一种基于LLM的两阶段提示策略,将图像密集编码为包含详细视觉特征的场景图;(ii)通过关联提及实体与外部事实,构建耦合概念图;(iii)设计定制的伪孪生图介质融合机制实现充分的多模态融合。该机制利用两图中的共享提及实体作为介质桥接紧密的跨模态信息交换,同时通过限制融合仅发生在介质区域,最大程度保留洞察性的模态内学习。在两个基准数据集上的广泛实验表明,MAIL在仅需24倍资源的情况下仍具有优越性能。