HiREN: Towards Higher Supervision Quality for Better Scene Text Image Super-Resolution

Scene text image super-resolution (STISR) is an important pre-processing technique for text recognition from low-resolution scene images. Nowadays, various methods have been proposed to extract text-specific information from high-resolution (HR) images to supervise STISR model training. However, due to uncontrollable factors (e.g. shooting equipment, focus, and environment) in manually photographing HR images, the quality of HR images cannot be guaranteed, which unavoidably impacts STISR performance. Observing the quality issue of HR images, in this paper we propose a novel idea to boost STISR by first enhancing the quality of HR images and then using the enhanced HR images as supervision to do STISR. Concretely, we develop a new STISR framework, called High-Resolution ENhancement (HiREN) that consists of two branches and a quality estimation module. The first branch is developed to recover the low-resolution (LR) images, and the other is an HR quality enhancement branch aiming at generating high-quality (HQ) text images based on the HR images to provide more accurate supervision to the LR images. As the degradation from HQ to HR may be diverse, and there is no pixel-level supervision for HQ image generation, we design a kernel-guided enhancement network to handle various degradation, and exploit the feedback from a recognizer and text-level annotations as weak supervision signal to train the HR enhancement branch. Then, a quality estimation module is employed to evaluate the qualities of HQ images, which are used to suppress the erroneous supervision information by weighting the loss of each image. Extensive experiments on TextZoom show that HiREN can work well with most existing STISR methods and significantly boost their performances.

翻译：场景文本图像超分辨率（STISR）是一种重要的预处理技术，用于从低分辨率场景图像中进行文本识别。当前，已有多种方法通过从高分辨率（HR）图像中提取文本特定信息来监督STISR模型训练。然而，由于人工拍摄的高分辨率图像存在不可控因素（如拍摄设备、焦距和环境），其质量无法得到保证，这不可避免地影响了STISR性能。针对HR图像的质量问题，本文提出一种新颖思路：先提升HR图像质量，再将增强后的HR图像作为监督信号进行STISR。具体而言，我们开发了一个名为高分辨率增强（HiREN）的新型STISR框架，该框架包含两个分支和一个质量评估模块。第一分支用于恢复低分辨率（LR）图像，另一分支为HR质量增强分支，旨在基于HR图像生成高质量（HQ）文本图像，从而为LR图像提供更精确的监督。由于从HQ到HR的退化过程可能具有多样性，且HQ图像生成缺乏像素级监督，我们设计了核引导增强网络处理多种退化，并利用识别器的反馈及文本级标注作为弱监督信号训练HR增强分支。随后，通过质量评估模块评估HQ图像的质量，并基于各图像的损失权重抑制错误的监督信息。在TextZoom数据集上的大量实验表明，HiREN能够与大多数现有STISR方法良好配合，并显著提升其性能。