Learning A Coarse-to-Fine Diffusion Transformer for Image Restoration

Recent years have witnessed the remarkable performance of diffusion models in various vision tasks. However, for image restoration that aims to recover clear images with sharper details from given degraded observations, diffusion-based methods may fail to recover promising results due to inaccurate noise estimation. Moreover, simple constraining noises cannot effectively learn complex degradation information, which subsequently hinders the model capacity. To solve the above problems, we propose a coarse-to-fine diffusion Transformer (C2F-DFT) for image restoration. Specifically, our C2F-DFT contains diffusion self-attention (DFSA) and diffusion feed-forward network (DFN) within a new coarse-to-fine training scheme. The DFSA and DFN respectively capture the long-range diffusion dependencies and learn hierarchy diffusion representation to facilitate better restoration. In the coarse training stage, our C2F-DFT estimates noises and then generates the final clean image by a sampling algorithm. To further improve the restoration quality, we propose a simple yet effective fine training scheme. It first exploits the coarse-trained diffusion model with fixed steps to generate restoration results, which then would be constrained with corresponding ground-truth ones to optimize the models to remedy the unsatisfactory results affected by inaccurate noise estimation. Extensive experiments show that C2F-DFT significantly outperforms diffusion-based restoration method IR-SDE and achieves competitive performance compared with Transformer-based state-of-the-art methods on $3$ tasks, including deraining, deblurring, and real denoising. The code is available at https://github.com/wlydlut/C2F-DFT.

翻译：近年来，扩散模型在各种视觉任务中展现出卓越的性能。然而，在旨在从给定退化观测中恢复具有更清晰细节的图像恢复任务中，基于扩散的方法可能因噪声估计不准确而无法恢复出理想结果。此外，简单的噪声约束无法有效学习复杂的退化信息，进而限制了模型能力。为解决上述问题，我们提出了一种由粗到精的扩散Transformer（C2F-DFT）用于图像恢复。具体而言，我们的C2F-DFT在一种新的由粗到精训练框架中包含扩散自注意力（DFSA）和扩散前馈网络（DFN）。DFSA和DFN分别捕捉长程扩散依赖关系并学习层次化扩散表示，从而促进更好的恢复效果。在粗训练阶段，我们的C2F-DFT估计噪声，然后通过采样算法生成最终清晰图像。为进一步提升恢复质量，我们提出了一种简单而有效的精细训练方案。该方案首先利用固定步数的粗训练扩散模型生成恢复结果，然后将其与对应的真实结果进行约束，以优化模型，从而弥补因噪声估计不准确而导致的不理想结果。大量实验表明，C2F-DFT显著优于基于扩散的恢复方法IR-SDE，并在去雨、去模糊和真实去噪三项任务上，与基于Transformer的最先进方法相比取得了具有竞争力的性能。代码已发布于https://github.com/wlydlut/C2F-DFT。