The Transformer has quickly become the dominant architecture for various pattern recognition tasks due to its capacity for long-range representation. However, transformers are data-hungry models and need large datasets for training. In Handwritten Text Recognition (HTR), collecting a massive amount of labeled data is a complicated and expensive task. In this paper, we propose a lite transformer architecture for full-page multi-script handwriting recognition. The proposed model comes with three advantages: First, to solve the common problem of data scarcity, we propose a lite transformer model that can be trained on a reasonable amount of data, which is the case of most HTR public datasets, without the need for external data. Second, it can learn the reading order at page-level thanks to a curriculum learning strategy, allowing it to avoid line segmentation errors, exploit a larger context and reduce the need for costly segmentation annotations. Third, it can be easily adapted to other scripts by applying a simple transfer-learning process using only page-level labeled images. Extensive experiments on different datasets with different scripts (French, English, Spanish, and Arabic) show the effectiveness of the proposed model.
翻译:Transformer凭借其长程表征能力,已迅速成为各类模式识别任务的主导架构。然而,Transformer模型需要大量数据进行训练,存在数据饥饿问题。在手写文本识别(HTR)领域,收集大规模标注数据是一项复杂且昂贵的任务。本文提出一种面向全页多文种手写识别的轻量Transformer架构。该模型具备三大优势:首先,针对常见的数据稀缺问题,我们提出一种可在合理数据量下训练的轻量Transformer模型,该数据量对应大多数HTR公开数据集规模,无需额外数据支撑;其次,通过课程学习策略,模型能够学习页面级的阅读顺序,从而避免行分割误差、利用更大上下文窗口,并减少对昂贵分割标注的需求;最后,仅需使用页面级标注图像进行简单迁移学习,即可将该模型轻松适配至其他文种。在法语、英语、西班牙语和阿拉伯语等多文种数据集上的大量实验验证了所提模型的有效性。