Document analysis and understanding models often require extensive annotated data to be trained. However, various document-related tasks extend beyond mere text transcription, requiring both textual content and precise bounding-box annotations to identify different document elements. Collecting such data becomes particularly challenging, especially in the context of invoices, where privacy concerns add an additional layer of complexity. In this paper, we introduce FATURA, a pivotal resource for researchers in the field of document analysis and understanding. FATURA is a highly diverse dataset featuring multi-layout, annotated invoice document images. Comprising $10,000$ invoices with $50$ distinct layouts, it represents the largest openly accessible image dataset of invoice documents known to date. We also provide comprehensive benchmarks for various document analysis and understanding tasks and conduct experiments under diverse training and evaluation scenarios. The dataset is freely accessible at https://zenodo.org/record/8261508, empowering researchers to advance the field of document analysis and understanding.
翻译:文档分析与理解模型通常需要大量标注数据进行训练。然而,各类文档相关任务不仅限于文本转录,还需要同时获取文本内容及精确的边界框标注以识别不同文档元素。这类数据的采集尤为困难,尤其是在涉及隐私问题的发票场景中,更增添了复杂性。本文介绍FATURA——文档分析与理解研究领域的关键资源。该数据集具有高度多样性特征,包含多版面标注的发票文档图像,由10,000张涵盖50种不同版式的发票构成,是迄今公开可获取的最大规模发票文档图像数据集。我们还为多项文档分析与理解任务提供了全面基准,并在不同训练与评估场景下开展实验。该数据集可通过https://zenodo.org/record/8261508免费获取,助力研究人员推动文档分析与理解领域的发展。