Scene Text Recognition (STR) is a challenging task due to variations in text style, shape, and background. Incorporating linguistic information is an effective way to enhance the robustness of STR models. Existing methods rely on permuted language modeling (PLM) or masked language modeling (MLM) to learn contextual information implicitly, either through an ensemble of permuted autoregressive (AR) LMs training or iterative non-autoregressive (NAR) decoding procedure. However, these methods exhibit limitations: PLM's AR decoding results in the lack of information about future characters, while MLM provides global information of the entire text but neglects dependencies among each predicted character. In this paper, we propose a Masked and Permuted Implicit Context Learning Network for STR, which unifies PLM and MLM within a single decoding architecture, inheriting the advantages of both approaches. We utilize the training procedure of PLM, and to integrate MLM, we incorporate word length information into the decoding process by introducing specific numbers of mask tokens. Experimental results demonstrate that our proposed model achieves state-of-the-art performance on standard benchmarks using both AR and NAR decoding procedures.
翻译:场景文本识别(STR)是一项具有挑战性的任务,原因在于文本风格、形状和背景的多样性。融入语言信息是增强STR模型鲁棒性的有效途径。现有方法依赖排列语言建模(PLM)或掩码语言建模(MLM)隐式学习上下文信息,这些方法要么采用排列自回归(AR)语言模型训练的集成方式,要么采用迭代非自回归(NAR)解码过程。然而,这些方法存在局限性:PLM的AR解码导致未来字符信息缺失,而MLM虽能提供整个文本的全局信息,但忽略了每个预测字符间的依赖关系。本文提出了一种面向STR的掩码与排列隐式上下文学习网络,该网络在单一解码架构中统一了PLM和MLM,继承了两种方法的优势。我们采用PLM的训练流程,同时为融入MLM,通过引入特定数量的掩码令牌将词长信息纳入解码过程。实验结果表明,在使用AR和NAR两种解码流程时,所提模型在标准基准测试上均达到了最先进的性能水平。