A large number of annotated training images is crucial for training successful scene text recognition models. However, collecting sufficient datasets can be a labor-intensive and costly process, particularly for low-resource languages. To address this challenge, auto-generating text data has shown promise in alleviating the problem. Unfortunately, existing scene text generation methods typically rely on a large amount of paired data, which is difficult to obtain for low-resource languages. In this paper, we propose a novel weakly supervised scene text generation method that leverages a few recognition-level labels as weak supervision. The proposed method is able to generate a large amount of scene text images with diverse backgrounds and font styles through cross-language generation. Our method disentangles the content and style features of scene text images, with the former representing textual information and the latter representing characteristics such as font, alignment, and background. To preserve the complete content structure of generated images, we introduce an integrated attention module. Furthermore, to bridge the style gap in the style of different languages, we incorporate a pre-trained font classifier. We evaluate our method using state-of-the-art scene text recognition models. Experiments demonstrate that our generated scene text significantly improves the scene text recognition accuracy and help achieve higher accuracy when complemented with other generative methods.
翻译:大量标注的训练图像对于训练成功的场景文本识别模型至关重要。然而,收集足够的数据集是一项劳动密集且成本高昂的过程,尤其是在资源匮乏的语言中。为解决这一挑战,自动生成文本数据在缓解该问题上显示出潜力。遗憾的是,现有的场景文本生成方法通常依赖大量配对数据,而这在低资源语言中难以获取。本文提出了一种新颖的弱监督场景文本生成方法,利用少量识别级标签作为弱监督。该方法能够通过跨语言生成,产生大量具有多样背景和字体风格的场景文本图像。我们的方法将场景文本图像的内容与风格特征解耦,其中前者代表文本信息,后者代表字体、对齐和背景等特性。为保留生成图像的完整内容结构,我们引入了集成注意力模块。此外,为弥合不同语言在风格上的差异,我们整合了一个预训练的字体分类器。我们使用最先进的场景文本识别模型评估了该方法。实验表明,我们生成的场景文本显著提升了场景文本识别准确率,并在与其他生成方法互补时有助于实现更高的准确率。