Relevant information in documents is often summarized in tables, helping the reader to identify useful facts. Most benchmark datasets support either document layout analysis or table understanding, but lack in providing data to apply both tasks in a unified way. We define the task of Contextualized Table Extraction (CTE), which aims to extract and define the structure of tables considering the textual context of the document. The dataset comprises 75k fully annotated pages of scientific papers, including more than 35k tables. Data are gathered from PubMed Central, merging the information provided by annotations in the PubTables-1M and PubLayNet datasets. The dataset can support CTE and adds new classes to the original ones. The generated annotations can be used to develop end-to-end pipelines for various tasks, including document layout analysis, table detection, structure recognition, and functional analysis. We formally define CTE and evaluation metrics, showing which subtasks can be tackled, describing advantages, limitations, and future works of this collection of data. Annotations and code will be accessible a https://github.com/AILab-UniFI/cte-dataset.
翻译:文档中的相关信息常通过表格进行汇总,以帮助读者快速定位有用事实。现有基准数据集大多支持文档布局分析或表格理解,但缺乏统一方式将这两个任务结合应用的数据。本文定义了情境化表格抽取(Contextualized Table Extraction, CTE)任务,旨在结合文档文本语境抽取并定义表格结构。该数据集包含75,000页经过完整标注的科研论文页面,涵盖超过35,000个表格。数据源自PubMed Central,融合了PubTables-1M与PubLayNet数据集中的标注信息。该数据集可支持CTE任务,并在原始类别基础上新增了类别。生成的标注可用于构建端到端流水线,以处理文档布局分析、表格检测、结构识别及功能分析等多种任务。我们正式定义了CTE及评估指标,展示了可解决的子任务类型,并描述了该数据集的优势、局限性与未来工作方向。标注文件及代码将在https://github.com/AILab-UniFI/cte-dataset 开放获取。