Recent advances in Automatic Text Recognition (ATR) have improved access to historical archives, yet a methodological divide persists between palaeographic transcriptions and normalized digital editions. While ATR models trained on more palaeographically-oriented datasets such as CATMuS have shown greater generalizability, their raw outputs remain poorly compatible with most readers and downstream NLP tools, thus creating a usability gap. On the other hand, ATR models trained to produce normalized outputs have been shown to struggle to adapt to new domains and tend to over-normalize and hallucinate. We introduce the task of Pre-Editorial Normalization (PEN), which consists in normalizing graphemic ATR output according to editorial conventions, which has the advantage of keeping an intermediate step with palaeographic fidelity while providing a normalized version for practical usability. We present a new dataset derived from the CoMMA corpus and aligned with digitized Old French and Latin editions using passim. We also produce a manually corrected gold-standard evaluation set. We benchmark this resource using ByT5-based sequence-to-sequence models on normalization and pre-annotation tasks. Our contributions include the formal definition of PEN, a 4.66M-sample silver training corpus, a 1.8k-sample gold evaluation set, and a normalization model achieving a 6.7% CER, substantially outperforming previous models for this task.
翻译:近年来,自动文本识别(ATR)技术的进步改善了历史档案的可访问性,但古文书学转录与规范化数字版本之间仍存在方法论上的鸿沟。尽管基于更偏向古文书学特征的数据集(如CATMuS)训练的ATR模型展现出更强的泛化能力,其原始输出仍与大多数读者及下游自然语言处理工具的兼容性较差,从而造成了可用性缺口。另一方面,旨在生成规范化输出的ATR模型则表现出对新领域的适应困难,且易出现过度规范化与幻觉生成问题。本文提出预编辑规范化(PEN)任务,其核心在于依据编辑规范对字形级ATR输出进行规范化处理。该方法具有双重优势:既保留了具有古文书学保真度的中间步骤,又提供了具备实际可用性的规范化版本。我们基于CoMMA语料库构建了一个新数据集,并利用passim工具将其与数字化古法语及拉丁语版本进行对齐。同时,我们制作了经人工校正的黄金标准评估集。通过基于ByT5的序列到序列模型,我们在规范化与预标注任务上对该资源进行了基准测试。本研究的贡献包括:PEN任务的正式定义、包含466万样本的银标训练语料库、包含1800样本的黄金评估集,以及字符错误率低至6.7%的规范化模型——该模型在此任务上显著超越了先前模型的性能。