Semi-structured documents integrate diverse interleaved data elements (e.g., tables, charts, hierarchical paragraphs) arranged in various and often irregular layouts. These documents are widely observed across domains and account for a large portion of real-world data. However, existing methods struggle to support natural language question answering over these documents due to three main technical challenges: (1) The elements extracted by techniques like OCR are often fragmented and stripped of their original semantic context, making them inadequate for analysis. (2) Existing approaches lack effective representations to capture hierarchical structures within documents (e.g., associating tables with nested chapter titles) and to preserve layout-specific distinctions (e.g., differentiating sidebars from main content). (3) Answering questions often requires retrieving and aligning relevant information scattered across multiple regions or pages, such as linking a descriptive paragraph to table cells located elsewhere in the document. To address these issues, we propose MoDora, an LLM-powered system for semi-structured document analysis. First, we adopt a local-alignment aggregation strategy to convert OCR-parsed elements into layout-aware components, and conduct type-specific information extraction for components with hierarchical titles or non-text elements. Second, we design the Component-Correlation Tree (CCTree) to hierarchically organize components, explicitly modeling inter-component relations and layout distinctions through a bottom-up cascade summarization process. Finally, we propose a question-type-aware retrieval strategy that supports (1) layout-based grid partitioning for location-based retrieval and (2) LLM-guided pruning for semantic-based retrieval. Experiments show MoDora outperforms baselines by 5.97%-61.07% in accuracy. The code is at https://github.com/weAIDB/MoDora.
翻译:半结构化文档整合了多种交错排列的数据元素(例如表格、图表、层次化段落),这些元素通常以多样且不规则的布局进行组织。此类文档广泛存在于各个领域,并构成了现实世界数据的重要组成部分。然而,现有方法在支持对此类文档进行自然语言问答时面临三大技术挑战:(1)通过OCR等技术提取的元素往往呈现碎片化且丢失了原有的语义上下文,导致其难以直接用于分析。(2)现有方法缺乏有效的表示机制来捕捉文档内部的层次结构(例如将表格与嵌套的章节标题关联起来),并保留布局上的特异性差异(例如区分侧边栏与主体内容)。(3)回答问题通常需要检索并整合散布在多个区域或页面中的相关信息,例如将描述性段落与文档其他位置的表格单元格进行关联。为解决这些问题,我们提出了MoDora,一个基于大语言模型(LLM)的半结构化文档分析系统。首先,我们采用局部对齐聚合策略,将OCR解析的元素转换为具有布局感知的组件,并对具有层次化标题或非文本元素的组件进行类型特定的信息提取。其次,我们设计了组件关联树(CCTree)来层次化地组织组件,通过自底向上的级联摘要过程显式建模组件间关系与布局差异。最后,我们提出了一种问题类型感知的检索策略,支持(1)基于布局的网格划分以实现位置检索,以及(2)基于LLM引导的剪枝以实现语义检索。实验表明,MoDora在准确率上比基线方法提升了5.97%至61.07%。代码开源地址:https://github.com/weAIDB/MoDora。