Vision-language models such as CLIP have shown great impact on diverse downstream tasks for zero-shot or label-free predictions. However, when it comes to low-level vision such as image restoration their performance deteriorates dramatically due to corrupted inputs. In this paper, we present a degradation-aware vision-language model (DA-CLIP) to better transfer pretrained vision-language models to low-level vision tasks as a multi-task framework for image restoration. More specifically, DA-CLIP trains an additional controller that adapts the fixed CLIP image encoder to predict high-quality feature embeddings. By integrating the embedding into an image restoration network via cross-attention, we are able to pilot the model to learn a high-fidelity image reconstruction. The controller itself will also output a degradation feature that matches the real corruptions of the input, yielding a natural classifier for different degradation types. In addition, we construct a mixed degradation dataset with synthetic captions for DA-CLIP training. Our approach advances state-of-the-art performance on both \emph{degradation-specific} and \emph{unified} image restoration tasks, showing a promising direction of prompting image restoration with large-scale pretrained vision-language models. Our code is available at https://github.com/Algolzw/daclip-uir.
翻译:诸如CLIP等视觉-语言模型在零样本或无标签预测的各类下游任务中展现出巨大影响力。然而,当应用于图像修复等低级视觉任务时,由于输入图像受损,其性能会显著下降。本文提出一种退化感知视觉-语言模型(DA-CLIP),以更优方式将预训练视觉-语言模型迁移至低级视觉任务,形成图像修复的多任务框架。具体而言,DA-CLIP训练一个附加控制器,使固定的CLIP图像编码器适应性地预测高质量特征嵌入。通过交叉注意力机制将该嵌入集成至图像修复网络,我们能引导模型学习高保真图像重建。控制器本身还将输出与输入真实退化相匹配的退化特征,从而形成不同退化类型的天然分类器。此外,我们构建了包含合成标注的混合退化数据集用于DA-CLIP训练。本方法在\textit{特定退化}与\textit{统一}图像修复任务上均达到当前最优性能,为利用大规模预训练视觉-语言模型推动图像修复开辟了具有前景的方向。相关代码已开源至https://github.com/Algolzw/daclip-uir。