Scene Text Recognition (STR) is difficult because of the variations in text styles, shapes, and backgrounds. Though the integration of linguistic information enhances models' performance, existing methods based on either permuted language modeling (PLM) or masked language modeling (MLM) have their pitfalls. PLM's autoregressive decoding lacks foresight into subsequent characters, while MLM overlooks inter-character dependencies. Addressing these problems, we propose a masked and permuted implicit context learning network for STR, which unifies PLM and MLM within a single decoder, inheriting the advantages of both approaches. We utilize the training procedure of PLM, and to integrate MLM, we incorporate word length information into the decoding process and replace the undetermined characters with mask tokens. Besides, perturbation training is employed to train a more robust model against potential length prediction errors. Our empirical evaluations demonstrate the performance of our model. It not only achieves superior performance on the common benchmarks but also achieves a substantial improvement of $9.1\%$ on the more challenging Union14M-Benchmark.
翻译:场景文本识别(STR)因文本风格、形状和背景的多样性而面临挑战。尽管语言信息的整合提升了模型性能,但现有基于排列语言建模(PLM)或掩码语言建模(MLM)的方法各有缺陷:PLM的自回归解码缺乏对后续字符的预见性,而MLM则忽略了字符间的依赖关系。针对这些问题,我们提出了一种掩码与排列隐式上下文学习网络用于STR,该网络在单一解码器内统一了PLM和MLM,继承了两类方法的优势。我们采用PLM的训练过程,为集成MLM,将单词长度信息融入解码过程,并用掩码标记替换未确定字符。此外,引入扰动训练以增强模型对潜在长度预测误差的鲁棒性。实验评估表明,我们的模型不仅在常见基准测试上取得了优越性能,在更具挑战性的Union14M-Benchmark上还实现了9.1%的显著提升。