This paper introduces the DocILE benchmark with the largest dataset of business documents for the tasks of Key Information Localization and Extraction and Line Item Recognition. It contains 6.7k annotated business documents, 100k synthetically generated documents, and nearly~1M unlabeled documents for unsupervised pre-training. The dataset has been built with knowledge of domain- and task-specific aspects, resulting in the following key features: (i) annotations in 55 classes, which surpasses the granularity of previously published key information extraction datasets by a large margin; (ii) Line Item Recognition represents a highly practical information extraction task, where key information has to be assigned to items in a table; (iii) documents come from numerous layouts and the test set includes zero- and few-shot cases as well as layouts commonly seen in the training set. The benchmark comes with several baselines, including RoBERTa, LayoutLMv3 and DETR-based Table Transformer. These baseline models were applied to both tasks of the DocILE benchmark, with results shared in this paper, offering a quick starting point for future work. The dataset and baselines are available at https://github.com/rossumai/docile.
翻译:本文介绍了DocILE基准,该基准包含用于关键信息定位与提取及行项目识别任务的最大业务文档数据集。数据集包含6.7千份已标注业务文档、10万份合成生成文档以及近100万份用于无监督预训练的未标注文档。该数据集基于领域和任务特定知识构建,具有以下关键特征:(i)涵盖55个类别的标注,其粒度远超此前发布的关键信息提取数据集;(ii)行项目识别代表一项高度实用的信息提取任务,需将关键信息分配到表格中的项目;(iii)文档来源多样,测试集涵盖零样本和小样本场景以及训练集中常见的布局。基准提供多种基线模型,包括RoBERTa、LayoutLMv3及基于DETR的表格Transformer。本文将基线模型应用于DocILE基准的两项任务并共享结果,为后续研究提供快速起步参考。数据集和基线代码见https://github.com/rossumai/docile。