Text detection is a crucial task in the analysis of historical documents. While datasets and benchmarks exist for text detection in manuscripts and maps, the study of text in mathematical diagrams has received little attention. To address this, we introduce a large-scale, diverse, open-access dataset of 948 historical astronomical diagrams containing 10,940 oriented polygonal text regions. Our dataset spans ten centuries (8th to 18th) and seven main linguistic traditions: Arabic and Persian (115), Chinese (332), Byzantine (233), Latin (185), Hebrew (48), and Sanskrit (35). It captures a wide range of diagram styles and textual content, from symbols to multi-line paragraphs. Each text instance is annotated with ordered polygons that precisely delineate text regions and encode the reading direction. In addition, we annotated the 2,293 regions in Latin diagrams with 20 class labels. We evaluated several strong baselines on our dataset, including TESTR, DeepSolo++, and Poly-DETR, a simple extension of DINO-DETR that we design to predict ordered polygon vertices. Poly-DETR achieves state-of-the-art performance on the MTHv2 and cBAD2019 benchmarks and provides a solid, simple baseline on our dataset. Code and dataset available online.
翻译:文本检测是历史文献分析中的一项关键任务。尽管针对手稿和地图的文本检测已存在数据集和基准,但数学图表中的文本研究却鲜受关注。为解决这一问题,我们引入了一个大规模、多样化、开放获取的数据集,包含948幅历史天文图表,含有10,940个定向多边形文本区域。该数据集跨越十个世纪(8至18世纪),涵盖七种主要语言传统:阿拉伯语与波斯语(115幅)、汉语(332幅)、拜占庭语(233幅)、拉丁语(185幅)、希伯来语(48幅)和梵语(35幅)。其覆盖了从符号到多行段落的广泛图表风格与文本内容。每个文本实例均通过有序多边形进行标注,能精确描绘文本区域并编码阅读方向。此外,我们对拉丁语图表中的2,293个区域进行了20个类别的标签标注。我们评估了多个强基线模型,包括TESTR、DeepSolo++以及我们设计的简单扩展DINO-DETR的Poly-DETR,后者用于预测有序多边形顶点。Poly-DETR在MTHv2和cBAD2019基准上达到了最先进性能,并为我们的数据集提供了一个坚实且简单的基线。代码和数据集已在线公开。