A Text Recognition Dataset from Sahidic Coptic Ancient Manuscripts

In this work, we target Handwritten Text Recognition (HTR) in low-resource scenarios, which arise from underrepresented languages, rare scripts, and degraded visual conditions typical of historical documents. We introduce SCAM (Sahidic Coptic Ancient Manuscripts), a new line-level dataset built from digitized ancient manuscripts written in the extinct Sahidic Coptic dialect. The dataset reflects a realistic and challenging setting, as it combines heterogeneous acquisition conditions across libraries with typical manuscript degradations such as ink fading, bleed-through, and material deterioration. In addition to visual complexity, SCAM poses significant linguistic challenges due to the scarcity of resources for Sahidic Coptic, its uncommon alphabet, and dialect-specific diacritics. To support research in low-resource HTR, we benchmark several state-of-the-art approaches based on different paradigms, highlighting their limitations and strengths in this setting. Our results underline the gap between current HTR performance on well-resourced modern scripts and historically grounded, low-resource scenarios, thus providing a reference point for future developments.

翻译：在本文中，我们聚焦于低资源场景下的手写文本识别（HTR），该场景通常源于代表性不足的语言、稀有书写系统以及历史文献典型的退化视觉条件。我们提出了SCAM（萨希迪克科普特古代手稿）数据集——一个基于数字化古代手稿构建的行级数据集，这些手稿采用已灭绝的萨希迪克科普特方言书写。该数据集反映了真实且具有挑战性的场景：它融合了跨图书馆的异构采集条件，以及诸如墨水褪色、渗墨和材料劣化等典型手稿退化现象。除视觉复杂性外，SCAM还因萨希迪克科普特语资源稀缺、其不常见字母表以及方言特有的变音符号而带来显著的语言学挑战。为支持低资源HTR研究，我们基于不同范式对多种前沿方法进行了基准测试，突出了它们在此场景下的局限性与优势。我们的结果揭示了当前HTR性能在资源丰富的现代文本与基于历史背景的低资源场景之间的差距，从而为未来发展提供了参考基准。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

PaperOrchestra：一种面向自动化 AI 学术论文撰写的多智能体框架

专知会员服务

13+阅读 · 4月9日

《深度文本哈希综述：基于二进制表示的高效语义文本检索》

专知会员服务

9+阅读 · 2025年11月3日

【AAAI2025】SAIL：面向样本的上下文学习用于文档信息提取

专知会员服务

21+阅读 · 2024年12月24日

AI预测历史？DeepMind 又发nature！使用Ithaca深度神经网络恢复和归因古代文本

专知会员服务

26+阅读 · 2022年3月10日