Transformer-based Language Models are widely used in Natural Language Processing related tasks. Thanks to their pre-training, they have been successfully adapted to Information Extraction in business documents. However, most pre-training tasks proposed in the literature for business documents are too generic and not sufficient to learn more complex structures. In this paper, we use LayoutLM, a language model pre-trained on a collection of business documents, and introduce two new pre-training tasks that further improve its capacity to extract relevant information. The first is aimed at better understanding the complex layout of documents, and the second focuses on numeric values and their order of magnitude. These tasks force the model to learn better-contextualized representations of the scanned documents. We further introduce a new post-processing algorithm to decode BIESO tags in Information Extraction that performs better with complex entities. Our method significantly improves extraction performance on both public (from 93.88 to 95.50 F1 score) and private (from 84.35 to 84.84 F1 score) datasets composed of expense receipts, invoices, and purchase orders.
翻译:基于Transformer的语言模型广泛应用于自然语言处理相关任务。得益于其预训练机制,这些模型已成功应用于商业文档的信息抽取任务。然而,现有文献中针对商业文档提出的大多数预训练任务过于通用,不足以学习更复杂的结构。本文采用在商业文档集合上预训练的语言模型LayoutLM,并引入两项新的预训练任务以进一步提升其提取相关信息的能力:第一项任务旨在更深入地理解文档的复杂布局,第二项任务聚焦于数值及其量级。这些任务迫使模型学习扫描文档的更好上下文化表征。我们进一步提出了一种新的后处理算法,用于解码信息抽取中的BIESO标签,该算法在处理复杂实体时表现更优。在由费用收据、发票和采购订单组成的公开数据集(F1分数从93.88提升至95.50)和私有数据集(F1分数从84.35提升至84.84)上,本文方法显著提升了抽取性能。