The multi-modal long-context document question-answering task aims to locate and integrate multi-modal evidences (such as texts, tables, charts, images, and layouts) distributed across multiple pages, for question understanding and answer generation. The existing methods can be categorized into Large Vision-Language Model (LVLM)-based and Retrieval-Augmented Generation (RAG)-based methods. However, the former were susceptible to hallucinations, while the latter struggled for inter-modal disconnection and cross-page fragmentation. To address these challenges, a novel multi-modal RAG model, named MHier-RAG, was proposed, leveraging both textual and visual information across long-range pages to facilitate accurate question answering for visual-rich documents. A hierarchical indexing method with the integration of flattened in-page chunks and topological cross-page chunks was designed to jointly establish in-page multi-modal associations and long-distance cross-page dependencies. By means of joint similarity evaluation and large language model (LLM)-based re-ranking, a multi-granularity semantic retrieval method, including the page-level parent page retrieval and document-level summary retrieval, was proposed to foster multi-modal evidence connection and long-distance evidence integration and reasoning. Experimental results performed on public datasets, MMLongBench-Doc and LongDocURL, demonstrated the superiority of our MHier-RAG method in understanding and answering modality-rich and multi-page documents.
翻译:多模态长上下文文档问答任务旨在定位并整合分布在多页中的多模态证据(如文本、表格、图表、图像与版面布局),以进行问题理解与答案生成。现有方法可分为基于大型视觉语言模型的方法与基于检索增强生成的方法。然而,前者易产生幻觉,后者则面临模态间割裂与跨页面碎片化的问题。为应对这些挑战,本文提出一种新颖的多模态RAG模型——MHier-RAG,其利用跨长页面范围的文本与视觉信息,以促进对视觉丰富文档的精准问答。设计了一种融合扁平化页内块与拓扑化跨页块的层次化索引方法,以共同建立页内多模态关联与长距离跨页面依赖。通过联合相似性评估与基于大语言模型的重新排序,提出了一种多粒度语义检索方法,包括页面级父页面检索与文档级摘要检索,以促进多模态证据连接及长距离证据整合与推理。在公开数据集MMLongBench-Doc与LongDocURL上进行的实验结果表明,我们的MHier-RAG方法在理解与回答模态丰富且多页面的文档方面具有优越性。