In this paper, we introduce YONOS-SR, a novel stable diffusion-based approach for image super-resolution that yields state-of-the-art results using only a single DDIM step. We propose a novel scale distillation approach to train our SR model. Instead of directly training our SR model on the scale factor of interest, we start by training a teacher model on a smaller magnification scale, thereby making the SR problem simpler for the teacher. We then train a student model for a higher magnification scale, using the predictions of the teacher as a target during the training. This process is repeated iteratively until we reach the target scale factor of the final model. The rationale behind our scale distillation is that the teacher aids the student diffusion model training by i) providing a target adapted to the current noise level rather than using the same target coming from ground truth data for all noise levels and ii) providing an accurate target as the teacher has a simpler task to solve. We empirically show that the distilled model significantly outperforms the model trained for high scales directly, specifically with few steps during inference. Having a strong diffusion model that requires only one step allows us to freeze the U-Net and fine-tune the decoder on top of it. We show that the combination of spatially distilled U-Net and fine-tuned decoder outperforms state-of-the-art methods requiring 200 steps with only one single step.
翻译:本文提出YONOS-SR,一种新颖的基于稳定扩散的图像超分辨率方法,仅需单步DDIM即可实现最先进的结果。我们提出了一种尺度蒸馏方法训练超分辨率模型,不同于直接在目标放大倍数上训练模型,我们首先在较小放大倍数上训练教师模型,从而简化教师网络需解决的超分辨率问题。随后,我们训练更高放大倍数的学生模型,在训练过程中以教师模型的预测结果作为目标。该过程迭代进行,直至达到最终模型的目标放大倍数。尺度蒸馏的核心原理在于:教师模型通过以下方式辅助学生扩散模型训练:(i)提供适应当前噪声水平的目标,而非对所有噪声水平使用来自真实数据集的相同目标;(ii)由于教师模型解决的任务更简单,可提供更精确的目标。实验表明,经蒸馏的模型显著优于直接在高倍数上训练的模型,尤其在推理步数较少时表现更佳。获得仅需单步的强大扩散模型后,我们可冻结U-Net并微调其上的解码器。结果表明,结合空间蒸馏的U-Net与微调解码器,仅需单步即可超越需200步的最先进方法。