Recognizing text lines from images is a challenging problem, especially for handwritten documents due to large variations in writing styles. While text line recognition models are generally trained on large corpora of real and synthetic data, such models can still make frequent mistakes if the handwriting is inscrutable or the image acquisition process adds corruptions, such as noise, blur, compression, etc. Writing style is generally quite consistent for an individual, which can be leveraged to correct mistakes made by such models. Motivated by this, we introduce the problem of adapting text line recognition models during test time. We focus on a challenging and realistic setting where, given only a single test image consisting of multiple text lines, the task is to adapt the model such that it performs better on the image, without any labels. We propose an iterative self-training approach that uses feedback from the language model to update the optical model, with confident self-labels in each iteration. The confidence measure is based on an augmentation mechanism that evaluates the divergence of the prediction of the model in a local region. We perform rigorous evaluation of our method on several benchmark datasets as well as their corrupted versions. Experimental results on multiple datasets spanning multiple scripts show that the proposed adaptation method offers an absolute improvement of up to 8% in character error rate with just a few iterations of self-training at test time.
翻译:摘要:从图像中识别文本行是一项具有挑战性的问题,尤其是对于手写文档,由于书写风格的巨大差异。尽管文本行识别模型通常在大规模真实与合成数据上训练,但当笔迹难以辨认或图像获取过程引入噪声、模糊、压缩等污染时,这些模型仍可能频繁出错。个体的书写风格通常较为一致,这可用于纠正此类模型的错误。受此启发,我们提出了在测试时自适应文本行识别模型的问题。我们聚焦于一个具有挑战性且现实的情景:给定仅含多个文本行的单一测试图像,任务是在无需任何标签的情况下自适应模型,使其在该图像上表现更优。我们提出了一种迭代式自训练方法,该方法利用语言模型的反馈来更新光学模型,并在每次迭代中使用可信的自标签。置信度度量基于一种增强机制,用于评估模型在局部区域预测的分散度。我们在多个基准数据集及其污染版本上进行了严格的评估。跨多种文字系统的多个数据集的实验结果表明,所提出的自适应方法在测试时仅需几次自训练迭代,即可在字符错误率上实现高达8%的绝对改进。