AnnoPage Dataset: Dataset of Non-Textual Elements in Documents with Fine-Grained Categorization

We introduce the AnnoPage Dataset, a novel collection of 7550 pages from historical documents, primarily in Czech and German, spanning from 1485 to the present, focusing on the late 19th and early 20th centuries. The dataset is designed to support research in document layout analysis and object detection. Each page is annotated with axis-aligned bounding boxes (AABB) representing elements of 25 categories of non-textual elements, such as images, maps, decorative elements, or charts, following the Czech Methodology of image document processing. The annotations were created by expert librarians to ensure accuracy and consistency. The dataset also incorporates pages from multiple, mainly historical, document datasets to enhance variability and maintain continuity. The dataset is divided into development and test subsets, with the test set carefully selected to maintain the category distribution. We provide baseline results using YOLO and DETR object detectors, offering a reference point for future research. The AnnoPage Dataset is publicly available on Zenodo (https://doi.org/10.5281/zenodo.12788419), along with ground-truth annotations in YOLO format.

翻译：本文介绍AnnoPage数据集，这是一个新颖的数据集合，包含来自历史文献的7550个页面，主要语种为捷克语和德语，时间跨度从1485年至今，重点涵盖19世纪末至20世纪初的文献。该数据集旨在支持文档版面分析和目标检测领域的研究。每个页面均依据捷克图像文档处理方法学，采用轴对齐边界框对25类非文本元素（如图像、地图、装饰元素或图表）进行标注。所有标注均由专业图书馆员完成，以确保准确性与一致性。该数据集还整合了多个（主要为历史）文档数据集中的页面，以增强数据多样性并保持连续性。数据集划分为开发子集与测试子集，其中测试集经过精心筛选以保持类别分布。我们使用YOLO和DETR目标检测器提供了基准测试结果，为后续研究提供参考基准。AnnoPage数据集已在Zenodo平台公开（https://doi.org/10.5281/zenodo.12788419），同时提供YOLO格式的真实标注数据。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

14+阅读 · 2022年3月12日

【CVPR 2022】一种无需使用负样本的自监督学习方法，Self-Supervised Predictive Learning: A Negative-Free Method for Sound Source Localization in Visual Scenes

专知会员服务

15+阅读 · 2022年3月12日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日