In this paper, we introduce a new perspective for improving image restoration by removing degradation in the textual representations of a given degraded image. Intuitively, restoration is much easier on text modality than image one. For example, it can be easily conducted by removing degradation-related words while keeping the content-aware words. Hence, we combine the advantages of images in detail description and ones of text in degradation removal to perform restoration. To address the cross-modal assistance, we propose to map the degraded images into textual representations for removing the degradations, and then convert the restored textual representations into a guidance image for assisting image restoration. In particular, We ingeniously embed an image-to-text mapper and text restoration module into CLIP-equipped text-to-image models to generate the guidance. Then, we adopt a simple coarse-to-fine approach to dynamically inject multi-scale information from guidance to image restoration networks. Extensive experiments are conducted on various image restoration tasks, including deblurring, dehazing, deraining, and denoising, and all-in-one image restoration. The results showcase that our method outperforms state-of-the-art ones across all these tasks. The codes and models are available at \url{https://github.com/mrluin/TextualDegRemoval}.
翻译:本文提出了一种新的视角来改进图像恢复,即通过去除给定退化图像在文本表示中的退化。直观上,在文本模态上进行恢复比在图像模态上容易得多。例如,可以通过去除与退化相关的词语同时保留与内容相关的词语来轻松实现。因此,我们结合了图像在细节描述方面的优势和文本在退化去除方面的优势来进行恢复。为了解决跨模态辅助问题,我们提出将退化图像映射到文本表示以去除退化,然后将恢复后的文本表示转换为引导图像,用于辅助图像恢复。具体而言,我们巧妙地将图像到文本映射器和文本恢复模块嵌入到配备CLIP的文本到图像模型中,以生成引导。接着,我们采用一种简单的从粗到细的方法,动态地将来自引导的多尺度信息注入到图像恢复网络。我们在各种图像恢复任务上进行了大量实验,包括去模糊、去雾、去雨、去噪以及一体化图像恢复。结果表明,我们的方法在所有任务中均优于当前最先进的方法。代码和模型可在 \url{https://github.com/mrluin/TextualDegRemoval} 获取。