Transformers have shown significant effectiveness for various vision tasks including both high-level vision and low-level vision. Recently, masked autoencoders (MAE) for feature pre-training have further unleashed the potential of Transformers, leading to state-of-the-art performances on various high-level vision tasks. However, the significance of MAE pre-training on low-level vision tasks has not been sufficiently explored. In this paper, we show that masked autoencoders are also scalable self-supervised learners for image processing tasks. We first present an efficient Transformer model considering both channel attention and shifted-window-based self-attention termed CSformer. Then we develop an effective MAE architecture for image processing (MAEIP) tasks. Extensive experimental results show that with the help of MAEIP pre-training, our proposed CSformer achieves state-of-the-art performance on various image processing tasks, including Gaussian denoising, real image denoising, single-image motion deblurring, defocus deblurring, and image deraining.
翻译:Transformer在包括高层视觉和低层视觉的多种视觉任务中展现出显著效果。近期,用于特征预训练的掩码自编码器(MAE)进一步释放了Transformer的潜力,在多种高层视觉任务上达到了最先进性能。然而,MAE预训练对低层视觉任务的重要意义尚未得到充分探索。本文证明,掩码自编码器同样是可扩展的图像处理任务自监督学习器。我们首先提出一种兼顾通道注意力与移动窗口自注意力的高效Transformer模型,称为CSformer;随后开发了一种适用于图像处理任务(MAEIP)的有效MAE架构。大量实验结果表明,借助MAEIP预训练,我们提出的CSformer在多种图像处理任务(包括高斯去噪、真实图像去噪、单图像运动去模糊、散焦去模糊及图像去雨)上均达到了最先进性能。