Although recent research applying text-to-image (T2I) diffusion models to real-world super-resolution (SR) has achieved remarkable progress, the misalignment of their targets leads to a suboptimal trade-off between inference speed and detail fidelity. Specifically, the T2I task requires multiple inference steps to synthesize images matching to prompts and reduces the latent dimension to lower generating difficulty. Contrariwise, SR can restore high-frequency details in fewer inference steps, but it necessitates a more reliable variational auto-encoder (VAE) to preserve input information. However, most diffusion-based SRs are multistep and use 4-channel VAEs, while existing models with 16-channel VAEs are overqualified diffusion transformers, e.g., FLUX (12B). To align the target, we present a one-step diffusion model for generative detail restoration, GenDR, distilled from a tailored diffusion model with a larger latent space. In detail, we train a new SD2.1-VAE16 (0.9B) via representation alignment to expand the latent space without increasing the model size. Regarding step distillation, we propose consistent score identity distillation (CiD) that incorporates SR task-specific loss into score distillation to leverage more SR priors and align the training target. Furthermore, we extend CiD with adversarial learning and representation alignment (CiDA) to enhance perceptual quality and accelerate training. We also polish the pipeline to achieve a more efficient inference. Experimental results demonstrate that GenDR achieves state-of-the-art performance in both quantitative metrics and visual fidelity.
翻译:尽管近期将文本到图像(T2I)扩散模型应用于真实世界超分辨率(SR)的研究取得了显著进展,但其目标的不匹配导致在推理速度与细节保真度之间难以达到最优权衡。具体而言,T2I任务需要多步推理来合成与提示匹配的图像,并通过降低潜在维度来减少生成难度。相反,SR任务可在更少推理步数内恢复高频细节,但需要更可靠的变分自编码器(VAE)来保留输入信息。然而,当前大多数基于扩散的SR方法均为多步推理且使用4通道VAE,而现有采用16通道VAE的模型(如FLUX(12B))属于过度参数化的扩散Transformer。为实现目标对齐,我们提出用于生成式细节恢复的一步扩散模型GenDR,该模型通过从定制化大潜在空间扩散模型中蒸馏得到。具体而言,我们通过表示对齐训练了新的SD2.1-VAE16(0.9B),在不增加模型规模的前提下扩展了潜在空间。在步数蒸馏方面,我们提出一致性分数恒等蒸馏(CiD),将SR任务特定损失融入分数蒸馏以利用更多SR先验并对齐训练目标。进一步,我们通过对抗学习与表示对齐扩展CiD(CiDA)以提升感知质量并加速训练。同时我们优化了推理流程以实现更高效的推断。实验结果表明,GenDR在定量指标与视觉保真度方面均达到最先进性能。