This paper introduces the DocILE benchmark with the largest dataset of business documents for the tasks of Key Information Localization and Extraction and Line Item Recognition. It contains 6.7k annotated business documents, 100k synthetically generated documents, and nearly~1M unlabeled documents for unsupervised pre-training. The dataset has been built with knowledge of domain- and task-specific aspects, resulting in the following key features: (i) annotations in 55 classes, which surpasses the granularity of previously published key information extraction datasets by a large margin; (ii) Line Item Recognition represents a highly practical information extraction task, where key information has to be assigned to items in a table; (iii) documents come from numerous layouts and the test set includes zero- and few-shot cases as well as layouts commonly seen in the training set. The benchmark comes with several baselines, including RoBERTa, LayoutLMv3 and DETR-based Table Transformer; applied to both tasks of the DocILE benchmark, with results shared in this paper, offering a quick starting point for future work. The dataset, baselines and supplementary material are available at https://github.com/rossumai/docile.
翻译:本文介绍了DocILE基准,该基准包含迄今为止规模最大的业务文档数据集,用于关键信息定位与提取以及行项识别任务。该数据集包含6.7万份已标注的业务文档、10万份合成生成的文档,以及近100万份用于无监督预训练的未标注文档。数据集的构建充分考虑了领域和任务特定因素,其关键特征包括:(i)标注类别覆盖55类,远超此前发布的关键信息提取数据集的粒度;(ii)行项识别是一个高度实用的信息提取任务,需要将关键信息关联到表格中的行项;(iii)文档涵盖多种布局,测试集包含零样本、少样本场景以及训练集中常见的布局。该基准提供了多个基线模型,包括RoBERTa、LayoutLMv3和基于DETR的表单Transformer,在两个DocILE任务上均进行了测试,并在本文中分享了结果,为后续研究提供了快速起点。数据集、基线模型及补充材料可在https://github.com/rossumai/docile获取。