The problem of document structure reconstruction refers to converting digital or scanned documents into corresponding semantic structures. Most existing works mainly focus on splitting the boundary of each element in a single document page, neglecting the reconstruction of semantic structure in multi-page documents. This paper introduces hierarchical reconstruction of document structures as a novel task suitable for NLP and CV fields. To better evaluate the system performance on the new task, we built a large-scale dataset named HRDoc, which consists of 2,500 multi-page documents with nearly 2 million semantic units. Every document in HRDoc has line-level annotations including categories and relations obtained from rule-based extractors and human annotators. Moreover, we proposed an encoder-decoder-based hierarchical document structure parsing system (DSPS) to tackle this problem. By adopting a multi-modal bidirectional encoder and a structure-aware GRU decoder with soft-mask operation, the DSPS model surpass the baseline method by a large margin. All scripts and datasets will be made publicly available at https://github.com/jfma-USTC/HRDoc.
翻译:文档结构重建问题旨在将数字或扫描文档转化为对应的语义结构。现有工作主要聚焦于单页文档中每个元素的边界分割,而忽略了多页文档中语义结构的重建。本文提出将文档结构的层次化重建作为一项适用于自然语言处理(NLP)和计算机视觉(CV)领域的新任务。为更好地评估系统在新任务上的性能,我们构建了一个大规模数据集HRDoc,该数据集包含2500份多页文档,拥有近200万个语义单元。HRDoc中每份文档均具备行级标注,包括由基于规则的抽取器和人工标注员共同获取的类别与关系。此外,我们提出了一种基于编码器-解码器的层次化文档结构解析系统(DSPS)以应对该问题。通过采用多模态双向编码器及带有软掩码操作的结构感知GRU解码器,DSPS模型大幅超越了基线方法。所有脚本与数据集将在https://github.com/jfma-USTC/HRDoc 公开提供。