Document layout understanding is a field of study that analyzes the spatial arrangement of information in a document hoping to understand its structure and layout. Models such as LayoutLM (and its subsequent iterations) can understand semi-structured documents with SotA results; however, the lack of open semi-structured data is a limitation in itself. While semi-structured data is common in everyday life (balance sheets, purchase orders, receipts), there is a lack of public datasets for training machine learning models for this type of document. In this investigation we propose a method to generate new, synthetic, layout information that can help overcoming this data shortage. According to our results, the proposed method performs better than LayoutTransformer, another popular layout generation method. We also show that, in some scenarios, text classification can improve when supported by bounding box information.
翻译:文档布局理解是一个研究领域,旨在通过分析文档中信息的空间排布来理解其结构与版式。LayoutLM(及其后续迭代版本)等模型能够以最先进的结果理解半结构化文档;然而,开放半结构化数据的缺乏本身即构成一项局限。尽管半结构化数据在日常生活中十分常见(如资产负债表、采购订单、收据),但用于训练此类文档机器学习模型的公开数据集仍然匮乏。本研究提出了一种生成新型合成布局信息的方法,有助于克服数据短缺问题。根据我们的实验结果,所提方法在性能上优于另一种流行的布局生成方法LayoutTransformer。我们还证明,在某些场景下,当获得边界框信息支持时,文本分类性能能够得到提升。