In this work, we target Handwritten Text Recognition (HTR) in low-resource scenarios, which arise from underrepresented languages, rare scripts, and degraded visual conditions typical of historical documents. We introduce SCAM (Sahidic Coptic Ancient Manuscripts), a new line-level dataset built from digitized ancient manuscripts written in the extinct Sahidic Coptic dialect. The dataset reflects a realistic and challenging setting, as it combines heterogeneous acquisition conditions across libraries with typical manuscript degradations such as ink fading, bleed-through, and material deterioration. In addition to visual complexity, SCAM poses significant linguistic challenges due to the scarcity of resources for Sahidic Coptic, its uncommon alphabet, and dialect-specific diacritics. To support research in low-resource HTR, we benchmark several state-of-the-art approaches based on different paradigms, highlighting their limitations and strengths in this setting. Our results underline the gap between current HTR performance on well-resourced modern scripts and historically grounded, low-resource scenarios, thus providing a reference point for future developments.
翻译:在本文中,我们聚焦于低资源场景下的手写文本识别(HTR),该场景通常源于代表性不足的语言、稀有书写系统以及历史文献典型的退化视觉条件。我们提出了SCAM(萨希迪克科普特古代手稿)数据集——一个基于数字化古代手稿构建的行级数据集,这些手稿采用已灭绝的萨希迪克科普特方言书写。该数据集反映了真实且具有挑战性的场景:它融合了跨图书馆的异构采集条件,以及诸如墨水褪色、渗墨和材料劣化等典型手稿退化现象。除视觉复杂性外,SCAM还因萨希迪克科普特语资源稀缺、其不常见字母表以及方言特有的变音符号而带来显著的语言学挑战。为支持低资源HTR研究,我们基于不同范式对多种前沿方法进行了基准测试,突出了它们在此场景下的局限性与优势。我们的结果揭示了当前HTR性能在资源丰富的现代文本与基于历史背景的低资源场景之间的差距,从而为未来发展提供了参考基准。