In recent years, the challenge of extracting information from business documents has emerged as a critical task, finding applications across numerous domains. This effort has attracted substantial interest from both industry and academy, highlighting its significance in the current technological landscape. Most datasets in this area are primarily focused on Key Information Extraction (KIE), where the extraction process revolves around extracting information using a specific, predefined set of keys. Unlike most existing datasets and benchmarks, our focus is on discovering key-value pairs (KVPs) without relying on predefined keys, navigating through an array of diverse templates and complex layouts. This task presents unique challenges, primarily due to the absence of comprehensive datasets and benchmarks tailored for non-predetermined KVP extraction. To address this gap, we introduce KVP10k , a new dataset and benchmark specifically designed for KVP extraction. The dataset contains 10707 richly annotated images. In our benchmark, we also introduce a new challenging task that combines elements of KIE as well as KVP in a single task. KVP10k sets itself apart with its extensive diversity in data and richly detailed annotations, paving the way for advancements in the field of information extraction from complex business documents.
翻译:近年来,从商业文档中提取信息已成为一项关键任务,并在众多领域得到应用。这一任务吸引了工业界和学术界的广泛关注,凸显了其在当前技术格局中的重要性。该领域的大多数数据集主要聚焦于关键信息抽取(KIE),其提取过程围绕使用特定预定义键集进行信息抽取展开。与现有的大多数数据集和基准不同,我们的重点是在不依赖预定义键的情况下发现键值对(KVP),并遍历各种多样化模板和复杂布局。此任务面临独特挑战,主要源于缺乏针对非预定义KVP抽取的综合数据集和基准。为填补这一空白,我们引入KVP10k——专为KVP抽取设计的新数据集和基准。该数据集包含10707张经过丰富标注的图像。在我们的基准中,还引入了一项结合KIE与KVP元素的全新挑战性任务。KVP10k凭借其广泛的数据多样性和详尽的标注脱颖而出,为复杂商业文档信息抽取领域的进步铺平了道路。