Scene text erasing seeks to erase text contents from scene images and current state-of-the-art text erasing models are trained on large-scale synthetic data. Although data synthetic engines can provide vast amounts of annotated training samples, there are differences between synthetic and real-world data. In this paper, we employ self-supervision for feature representation on unlabeled real-world scene text images. A novel pretext task is designed to keep consistent among text stroke masks of image variants. We design the Progressive Erasing Network in order to remove residual texts. The scene text is erased progressively by leveraging the intermediate generated results which provide the foundation for subsequent higher quality results. Experiments show that our method significantly improves the generalization of the text erasing task and achieves state-of-the-art performance on public benchmarks.
翻译:场景文字擦除旨在从场景图像中擦除文字内容,当前最先进的文字擦除模型均基于大规模合成数据进行训练。尽管数据合成引擎能提供大量带标注的训练样本,但合成数据与真实数据之间存在差异。本文在无标注的真实场景文字图像上采用自监督学习进行特征表示,设计了一种新颖的预文本任务以保持图像变体间文字笔画掩膜的一致性。为消除残留文字,我们提出了渐进式擦除网络,通过利用中间生成结果(为后续更高质量输出奠定基础)逐步擦除场景文字。实验表明,我们的方法显著提升了文字擦除任务的泛化能力,并在公开基准上达到了最先进的性能。