Text region detection in historical astronomical diagrams

Text detection is a crucial task in the analysis of historical documents. While datasets and benchmarks exist for text detection in manuscripts and maps, the study of text in mathematical diagrams has received little attention. To address this, we introduce a large-scale, diverse, open-access dataset of 948 historical astronomical diagrams containing 10,940 oriented polygonal text regions. Our dataset spans ten centuries (8th to 18th) and seven main linguistic traditions: Arabic and Persian (115), Chinese (332), Byzantine (233), Latin (185), Hebrew (48), and Sanskrit (35). It captures a wide range of diagram styles and textual content, from symbols to multi-line paragraphs. Each text instance is annotated with ordered polygons that precisely delineate text regions and encode the reading direction. In addition, we annotated the 2,293 regions in Latin diagrams with 20 class labels. We evaluated several strong baselines on our dataset, including TESTR, DeepSolo++, and Poly-DETR, a simple extension of DINO-DETR that we design to predict ordered polygon vertices. Poly-DETR achieves state-of-the-art performance on the MTHv2 and cBAD2019 benchmarks and provides a solid, simple baseline on our dataset. Code and dataset available online.

翻译：文本检测是历史文献分析中的一项关键任务。尽管针对手稿和地图的文本检测已存在数据集和基准，但数学图表中的文本研究却鲜受关注。为解决这一问题，我们引入了一个大规模、多样化、开放获取的数据集，包含948幅历史天文图表，含有10,940个定向多边形文本区域。该数据集跨越十个世纪（8至18世纪），涵盖七种主要语言传统：阿拉伯语与波斯语（115幅）、汉语（332幅）、拜占庭语（233幅）、拉丁语（185幅）、希伯来语（48幅）和梵语（35幅）。其覆盖了从符号到多行段落的广泛图表风格与文本内容。每个文本实例均通过有序多边形进行标注，能精确描绘文本区域并编码阅读方向。此外，我们对拉丁语图表中的2,293个区域进行了20个类别的标签标注。我们评估了多个强基线模型，包括TESTR、DeepSolo++以及我们设计的简单扩展DINO-DETR的Poly-DETR，后者用于预测有序多边形顶点。Poly-DETR在MTHv2和cBAD2019基准上达到了最先进性能，并为我们的数据集提供了一个坚实且简单的基线。代码和数据集已在线公开。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

文本立场检测综述

专知会员服务

34+阅读 · 2021年11月2日

[ICCV 2021] 用于任意形状文本检测的自适应边界推荐网络

专知会员服务

11+阅读 · 2021年10月3日