Existing multimodal document question-answering (QA) systems predominantly rely on flat semantic retrieval, representing documents as a set of disconnected text chunks and largely neglecting their intrinsic hierarchical and relational structures. Such flattening disrupts logical and spatial dependencies - such as section organization, figure-text correspondence, and cross-reference relations, that humans naturally exploit for comprehension. To address this limitation, we introduce a document-level structural Document MAP (DMAP), which explicitly encodes both hierarchical organization and inter-element relationships within multimodal documents. Specifically, we design a Structured-Semantic Understanding Agent to construct DMAP by organizing textual content together with figures, tables, charts, etc. into a human-aligned hierarchical schema that captures both semantic and layout dependencies. Building upon this representation, a Reflective Reasoning Agent performs structure-aware and evidence-driven reasoning, dynamically assessing the sufficiency of retrieved context and iteratively refining answers through targeted interactions with DMAP. Extensive experiments on MMDocQA benchmarks demonstrate that DMAP yields document-specific structural representations aligned with human interpretive patterns, substantially enhancing retrieval precision, reasoning consistency, and multimodal comprehension over conventional RAG-based approaches. Code is available at https://github.com/Forlorin/DMAP
翻译:现有的多模态文档问答系统主要依赖扁平化语义检索,将文档表示为互不关联的文本片段集合,很大程度上忽略了其内在的层次化与关联性结构。这种扁平化处理破坏了人类自然用于理解文档的逻辑与空间依赖关系——例如章节组织、图文对应及交叉引用关系。为克服这一局限,我们提出一种文档级结构化文档图谱(DMAP),其显式编码多模态文档内部的层次化组织与元素间关联关系。具体而言,我们设计结构化语义理解智能体来构建DMAP,通过将文本内容与图表、表格、图示等元素组织成符合人类认知的层次化框架,同时捕捉语义与版式依赖关系。基于此表征,反思推理智能体执行结构感知与证据驱动的推理过程,动态评估检索上下文的充分性,并通过与DMAP的定向交互迭代优化答案。在MMDocQA基准上的大量实验表明,DMAP生成的文档特异性结构表征与人类解释模式高度对齐,相较于传统基于RAG的方法,在检索精度、推理一致性与多模态理解能力方面均取得显著提升。代码发布于https://github.com/Forlorin/DMAP