Document Layout Analysis, which is the task of identifying different semantic regions inside of a document page, is a subject of great interest for both computer scientists and humanities scholars as it represents a fundamental step towards further analysis tasks for the former and a powerful tool to improve and facilitate the study of the documents for the latter. However, many of the works currently present in the literature, especially when it comes to the available datasets, fail to meet the needs of both worlds and, in particular, tend to lean towards the needs and common practices of the computer science side, leading to resources that are not representative of the humanities real needs. For this reason, the present paper introduces U-DIADS-Bib, a novel, pixel-precise, non-overlapping and noiseless document layout analysis dataset developed in close collaboration between specialists in the fields of computer vision and humanities. Furthermore, we propose a novel, computer-aided, segmentation pipeline in order to alleviate the burden represented by the time-consuming process of manual annotation, necessary for the generation of the ground truth segmentation maps. Finally, we present a standardized few-shot version of the dataset (U-DIADS-BibFS), with the aim of encouraging the development of models and solutions able to address this task with as few samples as possible, which would allow for more effective use in a real-world scenario, where collecting a large number of segmentations is not always feasible.
翻译:文档布局分析(Document Layout Analysis)是识别文档页面内不同语义区域的任务,对计算机科学家和人文学者均具有重要意义:对前者而言,这是迈向进一步分析任务的基础步骤;对后者而言,这是改进并促进文档研究的强大工具。然而,当前文献中的大量工作(尤其在可用数据集方面)未能同时满足双方需求,且往往倾向于计算机科学领域的需求与常规实践,导致所生成的资源无法反映人文学科的实际需求。为此,本文提出了U-DIADS-Bib——一个由计算机视觉与人文学科专家紧密协作开发的新型像素级精准、无重叠、无噪声的文档布局分析数据集。此外,我们提出了一种新颖的计算机辅助分割流程,以减轻生成真实标注分割图所需的手动标注耗时负担。最后,我们发布了该数据集的标准小样本版本(U-DIADS-BibFS),旨在鼓励开发能够以尽可能少的样本完成该任务的模型与解决方案,从而在收集大量分割标注并不总可行的现实场景中实现更高效的应用。