Handwriting recognition is a challenging and critical problem in the fields of pattern recognition and machine learning, with applications spanning a wide range of domains. In this paper, we focus on the specific issue of recognizing offline Arabic handwritten text. Existing approaches typically utilize a combination of convolutional neural networks for image feature extraction and recurrent neural networks for temporal modeling, with connectionist temporal classification used for text generation. However, these methods suffer from a lack of parallelization due to the sequential nature of recurrent neural networks. Furthermore, these models cannot account for linguistic rules, necessitating the use of an external language model in the post-processing stage to boost accuracy. To overcome these issues, we introduce two alternative architectures, namely the Transformer Transducer and the standard sequence-to-sequence Transformer, and compare their performance in terms of accuracy and speed. Our approach can model language dependencies and relies only on the attention mechanism, thereby making it more parallelizable and less complex. We employ pre-trained Transformers for both image understanding and language modeling. Our evaluation on the Arabic KHATT dataset demonstrates that our proposed method outperforms the current state-of-the-art approaches for recognizing offline Arabic handwritten text.
翻译:手写识别是模式识别与机器学习领域中一个具有挑战性且关键的问题,其应用涵盖众多领域。本文聚焦于离线阿拉伯语手写文本识别的具体问题。现有方法通常结合卷积神经网络进行图像特征提取、循环神经网络进行时序建模,并采用连接主义时序分类进行文本生成。然而,这些方法因循环神经网络的序列特性而缺乏并行化能力。此外,此类模型无法利用语言规则,需在后期处理阶段借助外部语言模型提升准确率。为解决上述问题,我们提出两种替代架构,即Transformer换能器和标准序列到序列Transformer,并从准确率和速度两方面比较其性能。我们的方法能够建模语言依赖关系,且仅依赖注意力机制,因此更具并行性且复杂度更低。我们采用预训练Transformer进行图像理解与语言建模。在阿拉伯语KHATT数据集上的评估表明,所提方法在离线阿拉伯语手写文本识别任务上优于当前最先进的方法。