The field of visually rich document understanding (VRDU) aims to solve a multitude of well-researched NLP tasks in a multi-modal domain. Several datasets exist for research on specific tasks of VRDU such as document classification (DC), key entity extraction (KEE), entity linking, visual question answering (VQA), inter alia. These datasets cover documents like invoices and receipts with sparse annotations such that they support one or two co-related tasks (e.g., entity extraction and entity linking). Unfortunately, only focusing on a single specific of documents or task is not representative of how documents often need to be processed in the wild - where variety in style and requirements is expected. In this paper, we introduce BuDDIE (Business Document Dataset for Information Extraction), the first multi-task dataset of 1,665 real-world business documents that contains rich and dense annotations for DC, KEE, and VQA. Our dataset consists of publicly available business entity documents from US state government websites. The documents are structured and vary in their style and layout across states and types (e.g., forms, certificates, reports, etc.). We provide data variety and quality metrics for BuDDIE as well as a series of baselines for each task. Our baselines cover traditional textual, multi-modal, and large language model approaches to VRDU.
翻译:视觉丰富的文档理解(VRDU)领域旨在多模态场景下解决一系列成熟研究的自然语言处理任务。目前存在多个针对VRDU特定任务(如文档分类(DC)、关键实体抽取(KEE)、实体链接、视觉问答(VQA)等)的研究数据集。这些数据集涵盖发票、收据等文档类型,但标注稀疏,仅支持单个或两个相关任务(如实体抽取与实体链接)。然而,仅聚焦于单一文档类型或任务无法真实反映现实世界中文档处理的多样性——其风格与需求往往存在差异。本文提出BuDDIE(面向信息抽取的商业文档数据集),这是首个包含1,665份真实商业文档的多任务数据集,为DC、KEE和VQA提供丰富稠密的标注。数据集源自美国州政府网站的公开商业实体文档,这些文档结构各异,其风格与版式因州和类型(如表单、证书、报告等)而变化。我们提供了BuDDIE的数据多样性与质量指标,以及各项任务的基线模型。这些基线涵盖了VRDU领域传统的文本方法、多模态方法及大语言模型方法。