Image denoising is a fundamental task in computer vision. While prevailing deep learning-based supervised and self-supervised methods have excelled in eliminating in-distribution noise, their susceptibility to out-of-distribution (OOD) noise remains a significant challenge. The recent emergence of contrastive language-image pre-training (CLIP) model has showcased exceptional capabilities in open-world image recognition and segmentation. Yet, the potential for leveraging CLIP to enhance the robustness of low-level tasks remains largely unexplored. This paper uncovers that certain dense features extracted from the frozen ResNet image encoder of CLIP exhibit distortion-invariant and content-related properties, which are highly desirable for generalizable denoising. Leveraging these properties, we devise an asymmetrical encoder-decoder denoising network, which incorporates dense features including the noisy image and its multi-scale features from the frozen ResNet encoder of CLIP into a learnable image decoder to achieve generalizable denoising. The progressive feature augmentation strategy is further proposed to mitigate feature overfitting and improve the robustness of the learnable decoder. Extensive experiments and comparisons conducted across diverse OOD noises, including synthetic noise, real-world sRGB noise, and low-dose CT image noise, demonstrate the superior generalization ability of our method.
翻译:图像去噪是计算机视觉中的一项基础任务。尽管当前基于深度学习的监督和自监督方法在消除分布内噪声方面表现出色,但其对分布外(OOD)噪声的敏感性依然是一个重大挑战。近期出现的对比语言-图像预训练(CLIP)模型在开放世界图像识别与分割领域展现了卓越能力,然而利用CLIP增强底层任务鲁棒性的潜力尚未得到充分探索。本文发现,从CLIP的冻结ResNet图像编码器中提取的某些密集特征兼具畸变不变性与内容相关性,这对实现泛化去噪至关重要。基于这些特性,我们设计了一种非对称编码器-解码器去噪网络,该网络将来自CLIP冻结ResNet编码器的含噪图像及其多尺度特征等密集信息融入可学习图像解码器,从而实现泛化去噪。进一步提出渐进式特征增强策略,以缓解特征过拟合问题并提升可学习解码器的鲁棒性。在包含合成噪声、真实sRGB噪声及低剂量CT图像噪声在内的多种分布外噪声场景下进行的广泛实验与对比表明,我们的方法具备卓越的泛化能力。