Semi-structured documents integrate diverse interleaved data elements (e.g., tables, charts, hierarchical paragraphs) arranged in various and often irregular layouts. These documents are widely observed across domains and account for a large portion of real-world data. However, existing methods struggle to support natural language question answering over these documents due to three main technical challenges: (1) The elements extracted by techniques like OCR are often fragmented and stripped of their original semantic context, making them inadequate for analysis. (2) Existing approaches lack effective representations to capture hierarchical structures within documents (e.g., associating tables with nested chapter titles) and to preserve layout-specific distinctions (e.g., differentiating sidebars from main content). (3) Answering questions often requires retrieving and aligning relevant information scattered across multiple regions or pages, such as linking a descriptive paragraph to table cells located elsewhere in the document. To address these issues, we propose MoDora, an LLM-powered system for semi-structured document analysis. First, we adopt a local-alignment aggregation strategy to convert OCR-parsed elements into layout-aware components, and conduct type-specific information extraction for components with hierarchical titles or non-text elements. Second, we design the Component-Correlation Tree (CCTree) to hierarchically organize components, explicitly modeling inter-component relations and layout distinctions through a bottom-up cascade summarization process. Finally, we propose a question-type-aware retrieval strategy that supports (1) layout-based grid partitioning for location-based retrieval and (2) LLM-guided pruning for semantic-based retrieval. Experiments show MoDora outperforms baselines by 5.97%-61.07% in accuracy. The code is at https://github.com/weAIDB/MoDora.
翻译:半结构化文档整合了多种交错排列的数据元素(例如表格、图表、层级段落),这些元素通常以多样且不规则的布局形式呈现。此类文档广泛存在于各领域,构成了现实世界数据的重要组成部分。然而,现有方法在处理针对此类文档的自然语言问答任务时面临三大技术挑战:(1) 通过OCR等技术提取的元素往往呈现碎片化状态,且丢失了原有的语义上下文,导致其难以直接用于分析。(2) 现有方法缺乏有效的表征方式来捕捉文档内部的层次结构(例如,将表格与嵌套的章节标题关联起来)以及保留布局上的特异性(例如,区分侧边栏与主体内容)。(3) 回答问题通常需要检索并整合散布在多个区域或页面中的相关信息,例如将描述性段落与位于文档其他位置的表格单元格进行关联。为解决这些问题,我们提出了MoDora,一个基于大语言模型(LLM)的半结构化文档分析系统。首先,我们采用局部对齐聚合策略,将OCR解析出的元素转换为具有布局感知的组件,并对包含层级标题或非文本元素的组件进行类型特定的信息提取。其次,我们设计了组件关联树(CCTree),以层次化方式组织组件,通过自底向上的级联摘要过程显式建模组件间关系与布局差异。最后,我们提出了一种问题类型感知的检索策略,该策略支持:(1) 基于布局的网格划分,用于基于位置的检索;(2) 基于LLM引导的剪枝,用于基于语义的检索。实验表明,MoDora在准确率上优于基线方法5.97%至61.07%。代码发布于 https://github.com/weAIDB/MoDora。