Diffusion distillation represents a highly promising direction for achieving faithful text-to-image generation in a few sampling steps. However, despite recent successes, existing distilled models still do not provide the full spectrum of diffusion abilities, such as real image inversion, which enables many precise image manipulation methods. This work aims to enrich distilled text-to-image diffusion models with the ability to effectively encode real images into their latent space. To this end, we introduce invertible Consistency Distillation (iCD), a generalized consistency distillation framework that facilitates both high-quality image synthesis and accurate image encoding in only 3-4 inference steps. Though the inversion problem for text-to-image diffusion models gets exacerbated by high classifier-free guidance scales, we notice that dynamic guidance significantly reduces reconstruction errors without noticeable degradation in generation performance. As a result, we demonstrate that iCD equipped with dynamic guidance may serve as a highly effective tool for zero-shot text-guided image editing, competing with more expensive state-of-the-art alternatives.
翻译:扩散蒸馏为实现高保真文本到图像生成提供了极具前景的少步采样方案。然而,尽管近期取得了一系列成功,现有蒸馏模型仍未能完全复现扩散模型的全部能力,例如实现多种精确图像处理方法的真实图像反演。本研究旨在为蒸馏文本到图像扩散模型赋予将真实图像高效编码至其潜在空间的能力。为此,我们提出了可逆一致性蒸馏(iCD)——一种广义的一致性蒸馏框架,仅需3-4步推理即可同时实现高质量图像合成与精确图像编码。虽然高分类器无关引导尺度会加剧文本到图像扩散模型的反演问题,但我们发现动态引导能显著降低重建误差,且不会导致生成性能明显下降。实验结果表明,配备动态引导机制的iCD可成为零样本文本引导图像编辑的高效工具,其性能与计算成本更高的前沿替代方案相当。