Question answering over visually rich documents (VRDs) requires reasoning not only over isolated content but also over documents' structural organization and cross-page dependencies. However, conventional retrieval-augmented generation (RAG) methods encode content in isolated chunks during ingestion, losing structural and cross-page dependencies, and retrieve a fixed number of pages at inference, regardless of the specific demands of the question or context. This often results in incomplete evidence retrieval and degraded answer quality for multi-page reasoning tasks. To address these limitations, we propose LAD-RAG, a novel Layout-Aware Dynamic RAG framework. During ingestion, LAD-RAG constructs a symbolic document graph that captures layout structure and cross-page dependencies, adding it alongside standard neural embeddings to yield a more holistic representation of the document. During inference, an LLM agent dynamically interacts with the neural and symbolic indices to adaptively retrieve the necessary evidence based on the query. Experiments on MMLongBench-Doc, LongDocURL, DUDE, and MP-DocVQA demonstrate that LAD-RAG improves retrieval, achieving over 90% perfect recall on average without any top-k tuning, and outperforming baseline retrievers by up to 20% in recall at comparable noise levels, yielding higher QA accuracy with minimal latency.
翻译:针对视觉丰富文档(VRD)的问答不仅需要对孤立内容进行推理,还需要理解文档的结构组织以及跨页依赖关系。然而,传统的检索增强生成(RAG)方法在文档摄入阶段将内容编码为孤立的文本块,丢失了结构信息和跨页依赖,并且在推理时固定检索特定数量的页面,而忽略了问题或上下文的具体需求。这通常导致在多页面推理任务中证据检索不完整,从而降低了答案质量。为解决这些局限性,我们提出了LAD-RAG,一种新颖的布局感知动态RAG框架。在摄入阶段,LAD-RAG构建一个捕获布局结构和跨页依赖关系的符号文档图,并将其与标准的神经嵌入相结合,从而得到文档更全面的表示。在推理阶段,一个大语言模型(LLM)智能体动态地与神经索引和符号索引进行交互,根据查询自适应地检索必要的证据。在MMLongBench-Doc、LongDocURL、DUDE和MP-DocVQA数据集上的实验表明,LAD-RAG显著提升了检索性能,在无需任何top-k调优的情况下平均实现了超过90%的完美召回率,并且在相近的噪声水平下,其召回率比基线检索器高出多达20%,从而以最小的延迟获得了更高的问答准确率。