Retrieval-augmented generation (RAG) enables large language models (LLMs) to dynamically access external information, which is powerful for answering questions over previously unseen documents. Nonetheless, they struggle with high-level conceptual understanding and holistic comprehension due to limited context windows, which constrain their ability to perform deep reasoning over long-form, domain-specific content such as full-length books. To solve this problem, knowledge graphs (KGs) have been leveraged to provide entity-centric structure and hierarchical summaries, offering more structured support for reasoning. However, existing KG-based RAG solutions remain restricted to text-only inputs and fail to leverage the complementary insights provided by other modalities such as vision. On the other hand, reasoning from visual documents requires textual, visual, and spatial cues into structured, hierarchical concepts. To address this issue, we introduce a multimodal knowledge graph-based RAG that enables cross-modal reasoning for better content understanding. Our method incorporates visual cues into the construction of knowledge graphs, the retrieval phase, and the answer generation process. Experimental results across both global and fine-grained question answering tasks show that our approach consistently outperforms existing RAG-based approaches on both textual and multimodal corpora.
翻译:检索增强生成(RAG)使大语言模型(LLMs)能够动态访问外部信息,从而在回答未见文档的问题上展现出强大能力。然而,由于上下文窗口有限,模型在高层次概念理解与整体性理解方面存在不足,难以对长篇领域特定内容(如完整书籍)进行深度推理。针对该问题,知识图谱(KGs)已被用于提供以实体为中心的结构化框架与层次化摘要,为推理过程提供更结构化的支持。但现有基于知识图谱的RAG方案仍局限于纯文本输入,未能利用视觉等其他模态提供的互补性洞见。另一方面,视觉文档推理需要将文本、视觉与空间线索整合为结构化层次化概念。为解决这一挑战,我们提出了一种基于多模态知识图谱的检索增强生成方法,实现了跨模态推理以增强内容理解。该方法将视觉线索融入知识图谱构建、检索阶段及答案生成全过程。在全局与细粒度问答任务上的实验结果表明,我们的方法在文本与多模态语料库上均持续优于现有基于RAG的方法。