Scene text recognition in low-resource languages frequently faces challenges due to the limited availability of training datasets derived from real-world scenes. This study proposes a novel approach that generates text images in low-resource languages by emulating the style of real text images from high-resource languages. Our approach utilizes a diffusion model that is conditioned on binary states: ``synthetic'' and ``real.'' The training of this model involves dual translation tasks, where it transforms plain text images into either synthetic or real text images, based on the binary states. This approach not only effectively differentiates between the two domains but also facilitates the model's explicit recognition of characters in the target language. Furthermore, to enhance the accuracy and variety of generated text images, we introduce two guidance techniques: Fidelity-Diversity Balancing Guidance and Fidelity Enhancement Guidance. Our experimental results demonstrate that the text images generated by our proposed framework can significantly improve the performance of scene text recognition models for low-resource languages.
翻译:低资源语言的场景文本识别常因现实场景训练数据集稀缺而面临挑战。本研究提出一种创新方法,通过模拟高资源语言真实文本图像的风格来生成低资源语言的文本图像。我们的方法采用基于二元状态("合成"与"真实")的扩散模型。该模型的训练包含双重翻译任务:根据二元状态将纯文本图像转换为合成或真实的文本图像。此方法不仅能有效区分两个域,还促使模型显式识别目标语言的字符。此外,为提升生成文本图像的准确性与多样性,我们引入了两种引导技术:保真度-多样性平衡引导与保真度增强引导。实验结果表明,我们提出的框架生成的文本图像能显著提升低资源语言场景文本识别模型的性能。