Diachronic Document Dataset for Semantic Layout Analysis

We present a novel, open-access dataset designed for semantic layout analysis, built to support document recreation workflows through mapping with the Text Encoding Initiative (TEI) standard. This dataset includes 7,254 annotated pages spanning a large temporal range (1600-2024) of digitised and born-digital materials across diverse document types (magazines, papers from sciences and humanities, PhD theses, monographs, plays, administrative reports, etc.) sorted into modular subsets. By incorporating content from different periods and genres, it addresses varying layout complexities and historical changes in document structure. The modular design allows domain-specific configurations. We evaluate object detection models on this dataset, examining the impact of input size and subset-based training. Results show that a 1280-pixel input size for YOLO is optimal and that training on subsets generally benefits from incorporating them into a generic model rather than fine-tuning pre-trained weights.

翻译：我们提出了一种新颖的、开放获取的语义布局分析数据集，旨在通过与文本编码倡议（TEI）标准的映射来支持文档重建工作流。该数据集包含7,254个标注页面，涵盖了一个大的时间范围（1600-2024年）的数字化及原生数字材料，涉及多种文档类型（杂志、科学与人文领域的论文、博士学位论文、专著、剧本、行政报告等），并整理为模块化子集。通过纳入不同时期和体裁的内容，该数据集处理了文档结构中的不同布局复杂性和历史变迁。其模块化设计允许进行特定领域的配置。我们在此数据集上评估了目标检测模型，考察了输入尺寸和基于子集训练的影响。结果表明，对于YOLO模型，1280像素的输入尺寸是最优的，并且基于子集的训练通常受益于将其整合到一个通用模型中，而非对预训练权重进行微调。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日