Information in industry, research, and the public sector is widely stored as rendered documents (e.g., PDF files, scans). Hence, to enable downstream tasks, systems are needed that map rendered documents onto a structured hierarchical format. However, existing systems for this task are limited by heuristics and are not end-to-end trainable. In this work, we introduce the Document Structure Generator (DSG), a novel system for document parsing that is fully end-to-end trainable. DSG combines a deep neural network for parsing (i) entities in documents (e.g., figures, text blocks, headers, etc.) and (ii) relations that capture the sequence and nested structure between entities. Unlike existing systems that rely on heuristics, our DSG is trained end-to-end, making it effective and flexible for real-world applications. We further contribute a new, large-scale dataset called E-Periodica comprising real-world magazines with complex document structures for evaluation. Our results demonstrate that our DSG outperforms commercial OCR tools and, on top of that, achieves state-of-the-art performance. To the best of our knowledge, our DSG system is the first end-to-end trainable system for hierarchical document parsing.
翻译:行业、研究和公共领域中的信息广泛以渲染文档(如PDF文件、扫描件)的形式存储。因此,为实现下游任务,需要能够将渲染文档映射为结构化层次格式的系统。然而,现有系统受限于启发式方法,且无法实现端到端训练。本文提出文档结构生成器(DSG)——一个完全端到端可训练的新型文档解析系统。DSG结合深度神经网络,可同时解析(i)文档中的实体(如图表、文本块、标题等)以及(ii)捕获实体间顺序与嵌套结构的关联关系。与依赖启发式方法的现有系统不同,DSG通过端到端训练,使其在实际应用中更加高效灵活。我们进一步贡献了一个名为E-Periodica的大规模数据集,该数据集包含具有复杂文档结构的真实杂志,用于系统评估。实验结果表明,我们的DSG系统性能优于商业OCR工具,并在此基础上达到了最先进的水平。据我们所知,DSG系统是首个用于层次化文档解析的端到端可训练系统。