We introduce the AnnoPage Dataset, a novel collection of 7550 pages from historical documents, primarily in Czech and German, spanning from 1485 to the present, focusing on the late 19th and early 20th centuries. The dataset is designed to support research in document layout analysis and object detection. Each page is annotated with axis-aligned bounding boxes (AABB) representing elements of 25 categories of non-textual elements, such as images, maps, decorative elements, or charts, following the Czech Methodology of image document processing. The annotations were created by expert librarians to ensure accuracy and consistency. The dataset also incorporates pages from multiple, mainly historical, document datasets to enhance variability and maintain continuity. The dataset is divided into development and test subsets, with the test set carefully selected to maintain the category distribution. We provide baseline results using YOLO and DETR object detectors, offering a reference point for future research. The AnnoPage Dataset is publicly available on Zenodo (https://doi.org/10.5281/zenodo.12788419), along with ground-truth annotations in YOLO format.
翻译:本文介绍AnnoPage数据集,这是一个新颖的数据集合,包含来自历史文献的7550个页面,主要语种为捷克语和德语,时间跨度从1485年至今,重点涵盖19世纪末至20世纪初的文献。该数据集旨在支持文档版面分析和目标检测领域的研究。每个页面均依据捷克图像文档处理方法学,采用轴对齐边界框对25类非文本元素(如图像、地图、装饰元素或图表)进行标注。所有标注均由专业图书馆员完成,以确保准确性与一致性。该数据集还整合了多个(主要为历史)文档数据集中的页面,以增强数据多样性并保持连续性。数据集划分为开发子集与测试子集,其中测试集经过精心筛选以保持类别分布。我们使用YOLO和DETR目标检测器提供了基准测试结果,为后续研究提供参考基准。AnnoPage数据集已在Zenodo平台公开(https://doi.org/10.5281/zenodo.12788419),同时提供YOLO格式的真实标注数据。