Every day, thousands of digital documents are generated with useful information for companies, public organizations, and citizens. Given the impossibility of processing them manually, the automatic processing of these documents is becoming increasingly necessary in certain sectors. However, this task remains challenging, since in most cases a text-only based parsing is not enough to fully understand the information presented through different components of varying significance. In this regard, Document Layout Analysis (DLA) has been an interesting research field for many years, which aims to detect and classify the basic components of a document. In this work, we used a procedure to semi-automatically annotate digital documents with different layout labels, including 4 basic layout blocks and 4 text categories. We apply this procedure to collect a novel database for DLA in the public affairs domain, using a set of 24 data sources from the Spanish Administration. The database comprises 37.9K documents with more than 441K document pages, and more than 8M labels associated to 8 layout block units. The results of our experiments validate the proposed text labeling procedure with accuracy up to 99%.
翻译:每天,成千上万的数字文档生成,其中包含对企业、公共组织和公民有用的信息。鉴于人工处理的不可能性,在特定领域中,文档的自动处理正变得越来越必要。然而,这一任务仍具挑战性,因为在大多数情况下,仅基于文本的解析不足以充分理解通过不同重要性的组件呈现的信息。在此方面,文档布局分析(DLA)已是一个备受关注的研究领域多年,其目标在于检测并分类文档的基本组成部分。在本工作中,我们采用了一种半自动标注数字文档的流程,使用了多种布局标签,包括4种基本布局块和4种文本类别。我们将此流程应用于收集一个面向公共事务领域的新颖DLA数据库,使用了来自西班牙行政机构的24个数据源。该数据库包含37.9K份文档,超过441K个文档页面,以及超过800万个与8种布局块单元相关的标签。我们的实验结果验证了所提出的文本标注流程,准确率高达99%。