Learning A Coarse-to-Fine Diffusion Transformer for Image Restoration

Recent years have witnessed the remarkable performance of diffusion models in various vision tasks. However, for image restoration that aims to recover clear images with sharper details from given degraded observations, diffusion-based methods may fail to recover promising results due to inaccurate noise estimation. Moreover, simple constraining noises cannot effectively learn complex degradation information, which subsequently hinders the model capacity. To solve the above problems, we propose a coarse-to-fine diffusion Transformer (C2F-DFT) for image restoration. Specifically, our C2F-DFT contains diffusion self-attention (DFSA) and diffusion feed-forward network (DFN) within a new coarse-to-fine training scheme. The DFSA and DFN respectively capture the long-range diffusion dependencies and learn hierarchy diffusion representation to facilitate better restoration. In the coarse training stage, our C2F-DFT estimates noises and then generates the final clean image by a sampling algorithm. To further improve the restoration quality, we propose a simple yet effective fine training scheme. It first exploits the coarse-trained diffusion model with fixed steps to generate restoration results, which then would be constrained with corresponding ground-truth ones to optimize the models to remedy the unsatisfactory results affected by inaccurate noise estimation. Extensive experiments show that C2F-DFT significantly outperforms diffusion-based restoration method IR-SDE and achieves competitive performance compared with Transformer-based state-of-the-art methods on $3$ tasks, including deraining, deblurring, and real denoising. The code is available at https://github.com/wlydlut/C2F-DFT.

翻译：近年来，扩散模型在各种视觉任务中展现了卓越性能。然而，在图像修复任务中（旨在从给定的降质观测中恢复细节更清晰的图像），基于扩散的方法可能因噪声估计不准确而无法获得令人满意的结果。此外，简单约束噪声无法有效学习复杂的退化信息，进而阻碍模型能力。为解决上述问题，我们提出一种面向图像修复的粗到细扩散Transformer（C2F-DFT）。具体而言，我们的C2F-DFT在全新的粗到细训练方案中集成了扩散自注意力机制（DFSA）和扩散前馈网络（DFN）。DFSA和DFN分别捕获长程扩散依赖关系并学习层次化扩散表示，以促进更好的修复效果。在粗训练阶段，C2F-DFT估计噪声，然后通过采样算法生成最终清晰图像。为进一步提升修复质量，我们提出一种简洁而有效的细训练方案：首先利用固定步数的粗训练扩散模型生成修复结果，随后将这些结果与对应的真实值进行约束，从而优化模型以弥补因噪声估计不准确导致的次优结果。大量实验表明，C2F-DFT显著优于基于扩散的修复方法IR-SDE，并在去雨、去模糊和真实去噪三项任务上与基于Transformer的最先进方法性能相当。代码已开源：https://github.com/wlydlut/C2F-DFT。