Self-Supervised Image Denoising for Real-World Images with Context-aware Transformer

In recent years, the development of deep learning has been pushing image denoising to a new level. Among them, self-supervised denoising is increasingly popular because it does not require any prior knowledge. Most of the existing self-supervised methods are based on convolutional neural networks (CNN), which are restricted by the locality of the receptive field and would cause color shifts or textures loss. In this paper, we propose a novel Denoise Transformer for real-world image denoising, which is mainly constructed with Context-aware Denoise Transformer (CADT) units and Secondary Noise Extractor (SNE) block. CADT is designed as a dual-branch structure, where the global branch uses a window-based Transformer encoder to extract the global information, while the local branch focuses on the extraction of local features with small receptive field. By incorporating CADT as basic components, we build a hierarchical network to directly learn the noise distribution information through residual learning and obtain the first stage denoised output. Then, we design SNE in low computation for secondary global noise extraction. Finally the blind spots are collected from the Denoise Transformer output and reconstructed, forming the final denoised image. Extensive experiments on the real-world SIDD benchmark achieve 50.62/0.990 for PSNR/SSIM, which is competitive with the current state-of-the-art method and only 0.17/0.001 lower. Visual comparisons on public sRGB, Raw-RGB and greyscale datasets prove that our proposed Denoise Transformer has a competitive performance, especially on blurred textures and low-light images, without using additional knowledge, e.g., noise level or noise type, regarding the underlying unknown noise.

翻译：近年来，深度学习的发展不断将图像去噪推向新高度。其中，自监督去噪因无需任何先验知识而日益流行。现有自监督方法大多基于卷积神经网络，受限于感受野的局部性，易导致颜色偏移或纹理丢失。本文提出一种面向真实图像去噪的新型Denoise Transformer，主要由上下文感知去噪Transformer单元和次级噪声提取模块构成。CADT采用双分支结构设计，其中全局分支利用基于窗口的Transformer编码器提取全局信息，而局部分支则专注于小感受野的局部特征提取。通过将CADT作为基本组件，我们构建了层次化网络，通过残差学习直接获取噪声分布信息，得到第一阶段去噪输出。随后，我们设计了低计算量的SNE进行次级全局噪声提取。最后，从Denoise Transformer输出中收集盲点并重建，形成最终去噪图像。在真实场景SIDD基准上的大量实验表明，PSNR/SSIM达到50.62/0.990，与当前最先进方法相比仅低0.17/0.001。在公开sRGB、Raw-RGB和灰度数据集上的视觉比较证明，所提出的Denoise Transformer无需借助底层未知噪声的先验知识（如噪声等级或类型），尤其在模糊纹理和低光照图像上表现出竞争性能。