Scene Text Image Super-Resolution (STISR) aims to enhance the resolution and legibility of text within low-resolution (LR) images, consequently elevating recognition accuracy in Scene Text Recognition (STR). Previous methods predominantly employ discriminative Convolutional Neural Networks (CNNs) augmented with diverse forms of text guidance to address this issue. Nevertheless, they remain deficient when confronted with severely blurred images, due to their insufficient generation capability when little structural or semantic information can be extracted from original images. Therefore, we introduce RGDiffSR, a Recognition-Guided Diffusion model for scene text image Super-Resolution, which exhibits great generative diversity and fidelity even in challenging scenarios. Moreover, we propose a Recognition-Guided Denoising Network, to guide the diffusion model generating LR-consistent results through succinct semantic guidance. Experiments on the TextZoom dataset demonstrate the superiority of RGDiffSR over prior state-of-the-art methods in both text recognition accuracy and image fidelity.
翻译:场景文本图像超分辨率(STISR)旨在提升低分辨率(LR)图像中文本的分辨率和可读性,从而提高场景文本识别(STR)的识别准确率。以往的方法主要采用增强型判别式卷积神经网络(CNN)结合多种文本引导形式来解决该问题。然而,当面对严重模糊的图像时,由于原始图像中可提取的结构或语义信息极少,这些方法因生成能力不足而存在缺陷。为此,我们提出RGDiffSR——一种用于场景文本图像超分辨率的识别引导扩散模型,该模型即使在挑战性场景下也能展现出强大的生成多样性和保真度。此外,我们设计了一种识别引导去噪网络,通过简洁的语义引导扩散模型生成与低分辨率图像一致的结果。在TextZoom数据集上的实验表明,RGDiffSR在文本识别准确率和图像保真度方面均优于现有最先进方法。