Long-document multimodal question answering requires a system to locate sparse evidence in long PDFs and integrate clues from text, tables, images, charts, and complex layouts. Existing RAG methods mostly rely on fixed Top-k retrieval over text chunks or pages. Text retrieval can compress the context but often loses visual and layout information; page-level visual retrieval preserves the original page, yet it also sends large irrelevant regions to the reader, leading to a static trade-off among evidence coverage, noise, and inference cost. This paper proposes MAGE-RAG, a multigranular adaptive graph evidence framework for long-document multimodal QA. MAGE-RAG uses page retrieval as the entry point for query-time evidence construction. Offline, it builds an evidence graph with page nodes and element nodes, encoding containment, reading order, layout adjacency, section hierarchy, and semantic-neighbor relations. At query time, an online evidence controller iteratively activates, opens, searches, and prunes evidence under explicit budgets. The resulting evidence subgraph is then rendered into structured multimodal reader input, allowing the LVLM to consume compact and relevant evidence within a limited context. On LongDocURL and MMLongBench-Doc, we establish a unified comparison and analysis protocol covering Direct MLLM, Text RAG, Page-level Visual RAG, and Graph/Agentic RAG. Experiments show that MAGE-RAG achieves 52.75 overall accuracy on LongDocURL, and 53.26 accuracy with 51.19 F1 on MMLongBench-Doc. Fine-grained breakdowns, budget-performance curves, ablations, and trace-based analysis further show that query-time evidence subgraph construction can balance dispersed evidence coverage with context-noise control. Our code is available at https://github.com/laonuo2004/MAGE-RAG.git.
翻译:长文档多模态问答要求系统在长PDF中定位稀疏证据,并从文本、表格、图像、图表及复杂布局中整合线索。现有RAG方法多依赖对文本块或页面的固定Top-k检索。文本检索虽能压缩上下文,但常丢失视觉和布局信息;页面级视觉检索虽保留原始页面,却会将大量无关区域传输给阅读器,导致证据覆盖范围、噪声与推理成本之间的静态权衡。本文提出MAGE-RAG——一种面向长文档多模态问答的多粒度自适应图证据框架。MAGE-RAG以页面检索作为查询时证据构建的入口。离线阶段,它构建包含页面节点和元素节点的证据图,编码容纳关系、阅读顺序、布局邻接、章节层次及语义邻近关系。查询时,在线证据控制器在显式预算下迭代激活、打开、搜索并修剪证据。生成的证据子图随后被渲染为结构化多模态阅读器输入,使大型视觉语言模型能够在有限上下文中获取紧凑且相关的证据。在LongDocURL和MMLongBench-Doc数据集上,我们建立了覆盖直接大语言模型、文本RAG、页面级视觉RAG及图/智能体RAG的统一对比与分析协议。实验表明,MAGE-RAG在LongDocURL上达到52.75的总体准确率,在MMLongBench-Doc上达到53.26准确率和51.19的F1值。细粒度分解、预算-性能曲线、消融实验及轨迹分析进一步表明,查询时证据子图构建能够平衡分散证据覆盖与上下文噪声控制。我们的代码开源地址为:https://github.com/laonuo2004/MAGE-RAG.git。