Existing text recognition methods usually need large-scale training data. Most of them rely on synthetic training data due to the lack of annotated real images. However, there is a domain gap between the synthetic data and real data, which limits the performance of the text recognition models. Recent self-supervised text recognition methods attempted to utilize unlabeled real images by introducing contrastive learning, which mainly learns the discrimination of the text images. Inspired by the observation that humans learn to recognize the texts through both reading and writing, we propose to learn discrimination and generation by integrating contrastive learning and masked image modeling in our self-supervised method. The contrastive learning branch is adopted to learn the discrimination of text images, which imitates the reading behavior of humans. Meanwhile, masked image modeling is firstly introduced for text recognition to learn the context generation of the text images, which is similar to the writing behavior. The experimental results show that our method outperforms previous self-supervised text recognition methods by 10.2%-20.2% on irregular scene text recognition datasets. Moreover, our proposed text recognizer exceeds previous state-of-the-art text recognition methods by averagely 5.3% on 11 benchmarks, with similar model size. We also demonstrate that our pre-trained model can be easily applied to other text-related tasks with obvious performance gain. The code is available at https://github.com/ayumiymk/DiG.
翻译:现有的文本识别方法通常需要大规模的训练数据。由于缺乏标注的真实图像,大多数方法依赖于合成训练数据。然而,合成数据与真实数据之间存在领域鸿沟,这限制了文本识别模型的性能。近期,自监督文本识别方法尝试通过引入对比学习来利用无标注的真实图像,主要关注文本图像的判别性学习。受人类通过阅读与书写两种方式学习文字识别的启发,我们提出在自监督方法中整合对比学习与掩码图像建模,以同时学习判别性与生成性。对比学习分支用于学习文本图像的判别性,模拟人类的阅读行为;而掩码图像建模首次被引入文本识别领域,用于学习文本图像的上下文生成,这与人类的书写行为类似。实验结果表明,我们的方法在不规则场景文本识别数据集上比先前的自监督文本识别方法提升了10.2%-20.2%。此外,在模型规模相近的情况下,我们所提出的文本识别器在11个基准测试上平均超越了过去最先进的文本识别方法5.3%。同时,我们证明了所预训练的模型可轻松应用于其他文本相关任务,并带来显著性能提升。代码已开源至 https://github.com/ayumiymk/DiG。