Table of contents (ToC) extraction centres on structuring documents in a hierarchical manner. In this paper, we propose a new dataset, ESGDoc, comprising 1,093 ESG annual reports from 563 companies spanning from 2001 to 2022. These reports pose significant challenges due to their diverse structures and extensive length. To address these challenges, we propose a new framework for Toc extraction, consisting of three steps: (1) Constructing an initial tree of text blocks based on reading order and font sizes; (2) Modelling each tree node (or text block) independently by considering its contextual information captured in node-centric subtree; (3) Modifying the original tree by taking appropriate action on each tree node (Keep, Delete, or Move). This construction-modelling-modification (CMM) process offers several benefits. It eliminates the need for pairwise modelling of section headings as in previous approaches, making document segmentation practically feasible. By incorporating structured information, each section heading can leverage both local and long-distance context relevant to itself. Experimental results show that our approach outperforms the previous state-of-the-art baseline with a fraction of running time. Our framework proves its scalability by effectively handling documents of any length.
翻译:目录提取的核心在于以层次化方式对文档进行结构化。本文提出了一个新数据集ESGDoc,包含来自563家公司2001年至2022年间的1093份ESG年度报告。这些报告因其多样的结构和超长篇幅而带来显著挑战。为解决这些问题,我们提出了一种新的目录提取框架,包含三个步骤:(1) 基于阅读顺序和字号构建文本块的初始树结构;(2) 通过考虑节点中心子树中捕获的上下文信息,对每个树节点(或文本块)进行独立建模;(3) 对每个树节点执行适当操作(保留、删除或移动)以修改原始树结构。这种构建-建模-修改(CMM)流程具有多项优势:它消除了以往方法中对章节标题进行两两建模的需求,使文档分割在实践中切实可行;通过融入结构化信息,每个章节标题既能利用局部上下文也能利用与其相关的长距离上下文。实验结果表明,我们的方法在运行时间仅为前最优基线模型一小部分的情况下,性能优于该基线。该框架通过有效处理任意长度的文档,证明了其可扩展性。