Handwriting recognition is a key technology for accessing the content of old manuscripts, helping to preserve cultural heritage. Deep learning shows an impressive performance in solving this task. However, to achieve its full potential, it requires a large amount of labeled data, which is difficult to obtain for ancient languages and scripts. Often, a trade-off has to be made between ground truth quantity and quality, as is the case for the recently introduced Bullinger database. It contains an impressive amount of over a hundred thousand labeled text line images of mostly premodern German and Latin texts that were obtained by automatically aligning existing page-level transcriptions with text line images. However, the alignment process introduces systematic errors, such as wrongly hyphenated words. In this paper, we investigate the impact of such errors on training and evaluation and suggest means to detect and correct typical alignment errors.
翻译:手写体识别是获取古代手稿内容的关键技术,有助于文化遗产保护。深度学习在解决该任务时展现出卓越性能,但充分挖掘其潜力需要大量标注数据——这对古代语言与文字而言难以获取。正如近期发布的Bullinger数据库所示,真实标注的数量与质量往往需要权衡取舍。该数据库包含超过十万行标注文本图像的庞大数据量(主要涵盖德语和拉丁语的近代前文本),这些图像通过将现有页面级转录文本与文本行图像自动对齐的方式生成。然而,对齐过程会引入系统性错误,例如连字符误分单词。本文旨在研究此类错误对训练与评估的影响,并提出检测与修正典型对齐错误的方法。