Understanding multimodal long-context documents that comprise multimodal chunks such as paragraphs, figures, and tables is challenging due to (1) cross-modal heterogeneity to localize relevant information across modalities, (2) cross-page reasoning to aggregate dispersed evidence across pages. To address these challenges, we are motivated to adopt a query-centric formulation that projects cross-modal and cross-page information into a unified query representation space, with queries acting as abstract semantic surrogates for heterogeneous multimodal content. In this paper, we propose a Multimodal Long-Context Document Retrieval Augmented Generation (MLDocRAG) framework that leverages a Multimodal Chunk-Query Graph (MCQG) to organize multimodal document content around semantically rich, answerable queries. MCQG is constructed via a multimodal document expansion process that generates fine-grained queries from heterogeneous document chunks and links them to their corresponding content across modalities and pages. This graph-based structure enables selective, query-centric retrieval and structured evidence aggregation, thereby enhancing grounding and coherence in multimodal long-context question answering. Experiments on datasets MMLongBench-Doc and LongDocURL demonstrate that MLDocRAG consistently improves retrieval quality and answer accuracy, demonstrating its effectiveness for multimodal long-context understanding.
翻译:理解包含段落、图表和表格等多模态块的多模态长上下文文档具有挑战性,原因在于:(1)跨模态异质性使得跨模态定位相关信息困难;(2)跨页面推理需要聚合分散在不同页面上的证据。为应对这些挑战,我们采用一种以查询为中心的表述方式,将跨模态和跨页面信息投影到一个统一的查询表示空间中,其中查询充当异构多模态内容的抽象语义代理。本文提出了一种多模态长上下文文档检索增强生成(MLDocRAG)框架,该框架利用多模态块-查询图(MCQG)围绕语义丰富、可回答的查询来组织多模态文档内容。MCQG通过多模态文档扩展过程构建,该过程从异构文档块生成细粒度查询,并将其链接到跨模态和跨页面的对应内容。这种基于图的结构实现了以查询为中心的选择性检索和结构化证据聚合,从而增强了多模态长上下文问答中的事实依据性和连贯性。在MMLongBench-Doc和LongDocURL数据集上的实验表明,MLDocRAG持续提升了检索质量和答案准确性,证明了其在多模态长上下文理解方面的有效性。