Document structure analysis (aka document layout analysis) is crucial for understanding the physical layout and logical structure of documents, with applications in information retrieval, document summarization, knowledge extraction, etc. In this paper, we concentrate on Hierarchical Document Structure Analysis (HDSA) to explore hierarchical relationships within structured documents created using authoring software employing hierarchical schemas, such as LaTeX, Microsoft Word, and HTML. To comprehensively analyze hierarchical document structures, we propose a tree construction based approach that addresses multiple subtasks concurrently, including page object detection (Detect), reading order prediction of identified objects (Order), and the construction of intended hierarchical structure (Construct). We present an effective end-to-end solution based on this framework to demonstrate its performance. To assess our approach, we develop a comprehensive benchmark called Comp-HRDoc, which evaluates the above subtasks simultaneously. Our end-to-end system achieves state-of-the-art performance on two large-scale document layout analysis datasets (PubLayNet and DocLayNet), a high-quality hierarchical document structure reconstruction dataset (HRDoc), and our Comp-HRDoc benchmark. The Comp-HRDoc benchmark will be released to facilitate further research in this field.
翻译:摘要:文档结构分析(亦称文档布局分析)对于理解文档的物理布局和逻辑结构至关重要,在信息检索、文档摘要、知识提取等领域具有广泛应用。本文聚焦于层次化文档结构分析(HDSA),旨在探索使用层次化模式的创作软件(如LaTeX、Microsoft Word和HTML)生成的、具有层次结构的文档中的层级关系。为全面分析层次化文档结构,我们提出一种基于树构建的方法,该方法同时处理多个子任务,包括页面对象检测(检测)、识别对象的阅读顺序预测(排序),以及目标层次结构的构建(构建)。我们基于此框架提出一种有效的端到端解决方案,以展示其性能。为评估该方法,我们构建了一个综合性基准Comp-HRDoc,可同时评估上述子任务。我们的端到端系统在两个大规模文档布局分析数据集(PubLayNet和DocLayNet)、一个高质量层次化文档结构重建数据集(HRDoc)以及我们提出的Comp-HRDoc基准上均取得了当前最优性能。Comp-HRDoc基准将公开发布,以促进该领域的进一步研究。