Scene Text Image Super-resolution (STISR) has recently achieved great success as a preprocessing method for scene text recognition. STISR aims to transform blurred and noisy low-resolution (LR) text images in real-world settings into clear high-resolution (HR) text images suitable for scene text recognition. In this study, we leverage text-conditional diffusion models (DMs), known for their impressive text-to-image synthesis capabilities, for STISR tasks. Our experimental results revealed that text-conditional DMs notably surpass existing STISR methods. Especially when texts from LR text images are given as input, the text-conditional DMs are able to produce superior quality super-resolution text images. Utilizing this capability, we propose a novel framework for synthesizing LR-HR paired text image datasets. This framework consists of three specialized text-conditional DMs, each dedicated to text image synthesis, super-resolution, and image degradation. These three modules are vital for synthesizing distinct LR and HR paired images, which are more suitable for training STISR methods. Our experiments confirmed that these synthesized image pairs significantly enhance the performance of STISR methods in the TextZoom evaluation.
翻译:场景文本图像超分辨率(STISR)近年来作为场景文本识别的预处理方法取得了显著成功。STISR旨在将真实场景中模糊且噪声严重的低分辨率(LR)文本图像转换为适合场景文本识别的清晰高分辨率(HR)文本图像。本研究利用以文本到图像合成能力著称的文本条件扩散模型(DMs)来处理STISR任务。实验结果表明,文本条件DMs显著超越了现有STISR方法。特别是当输入低分辨率文本图像中的文本时,文本条件DMs能够生成质量更优的超分辨率文本图像。基于此能力,我们提出了一种全新的框架用于合成低分辨率-高分辨率(LR-HR)配对文本图像数据集。该框架包含三个专用文本条件DMs,分别负责文本图像合成、超分辨率及图像退化。这三个模块对合成独特的低分辨率与高分辨率配对图像至关重要,这些图像更适用于训练STISR方法。实验证实,这些合成的图像对在TextZoom评估中显著提升了STISR方法的性能。