Semi-structured documents integrate diverse interleaved data elements (e.g., tables, charts, hierarchical paragraphs) arranged in various and often irregular layouts. These documents are widely observed across domains and account for a large portion of real-world data. However, existing methods struggle to support natural language question answering over these documents due to three main technical challenges: (1) The elements extracted by techniques like OCR are often fragmented and stripped of their original semantic context, making them inadequate for analysis. (2) Existing approaches lack effective representations to capture hierarchical structures within documents (e.g., associating tables with nested chapter titles) and to preserve layout-specific distinctions (e.g., differentiating sidebars from main content). (3) Answering questions often requires retrieving and aligning relevant information scattered across multiple regions or pages, such as linking a descriptive paragraph to table cells located elsewhere in the document. To address these issues, we propose MoDora, an LLM-powered system for semi-structured document analysis. First, we adopt a local-alignment aggregation strategy to convert OCR-parsed elements into layout-aware components, and conduct type-specific information extraction for components with hierarchical titles or non-text elements. Second, we design the Component-Correlation Tree (CCTree) to hierarchically organize components, explicitly modeling inter-component relations and layout distinctions through a bottom-up cascade summarization process. Finally, we propose a question-type-aware retrieval strategy that supports (1) layout-based grid partitioning for location-based retrieval and (2) LLM-guided pruning for semantic-based retrieval. Experiments show MoDora outperforms baselines by 5.97%-61.07% in accuracy. The code is at https://github.com/weAIDB/MoDora.
翻译:半结构化文档整合了多种交错的数据元素(例如表格、图表、层次化段落),并以多样且通常不规则的版式排列。这类文档在各个领域广泛存在,占据了现实世界数据的一大部分。然而,现有方法难以支持针对这些文档的自然语言问答,主要面临三项技术挑战:(1) 通过OCR等技术提取的元素往往碎片化且剥离了原始语义上下文,使其不适合分析。(2) 现有方法缺乏有效表示来捕捉文档内的层次结构(例如,将表格与嵌套章节标题关联),并保留版式特定的区分(例如,区分侧边栏与主要内容)。(3) 回答问题通常需要检索并对齐分散在多个区域或页面中的相关信息,例如将一个描述性段落与文档其他位置的表格单元格关联。为解决这些问题,我们提出了MoDora,一个由LLM驱动的半结构化文档分析系统。首先,我们采用局部对齐聚合策略,将OCR解析的元素转换为感知版式的组件,并对带有层次标题或非文本元素的组件进行类型特定的信息提取。其次,我们设计了组件关联树(CCTree),通过自底向上的级联摘要过程,以层次化方式组织组件,并显式建模组件间关系与版式区分。最后,我们提出了一种问题类型感知的检索策略,支持:(1) 基于版式的网格划分用于位置检索,以及(2) 基于LLM引导的剪枝用于语义检索。实验表明,MoDora在准确率上相比基线方法提升了5.97%-61.07%。代码地址:https://github.com/weAIDB/MoDora。