Recent advancements in Deep Learning-based Handwritten Text Recognition (HTR) have led to models with remarkable performance on both modern and historical manuscripts in large benchmark datasets. Nonetheless, those models struggle to obtain the same performance when applied to manuscripts with peculiar characteristics, such as language, paper support, ink, and author handwriting. This issue is very relevant for valuable but small collections of documents preserved in historical archives, for which obtaining sufficient annotated training data is costly or, in some cases, unfeasible. To overcome this challenge, a possible solution is to pretrain HTR models on large datasets and then fine-tune them on small single-author collections. In this paper, we take into account large, real benchmark datasets and synthetic ones obtained with a styled Handwritten Text Generation model. Through extensive experimental analysis, also considering the amount of fine-tuning lines, we give a quantitative indication of the most relevant characteristics of such data for obtaining an HTR model able to effectively transcribe manuscripts in small collections with as little as five real fine-tuning lines.
翻译:近年来,基于深度学习的手写文本识别(Handwritten Text Recognition, HTR)领域取得了显著进展,在大型基准数据集上,针对现代和历史手稿都获得了卓越性能的模型。然而,当将这些模型应用于具有特殊特征(如语言、纸张载体、墨迹及作者笔迹)的手稿时,其性能往往难以保持相同水平。这一问题对于历史档案馆中保存的珍贵但规模较小的文献集而言尤为突出,因为获取此类文献集足够的标注训练数据成本高昂,或在某些情况下不可行。为克服这一挑战,一种可能的解决方案是在大型数据集上预训练HTR模型,然后针对小型单作者文献集进行微调。本文综合考虑了大型真实基准数据集以及通过风格化手写文本生成模型获得的合成数据集。通过广泛的实验分析,并考虑微调行数,本文就此类数据中最相关的特征给出了定量指示,旨在获得能够仅凭少至五行的真实微调数据即可有效转录小型文献集中手稿的HTR模型。