Denoising diffusion models have shown outstanding performance in image editing. Existing works tend to use either image-guided methods, which provide a visual reference but lack control over semantic coherence, or text-guided methods, which ensure faithfulness to text guidance but lack visual quality. To address the problem, we propose the Zero-shot Inversion Process (ZIP), a framework that injects a fusion of generated visual reference and text guidance into the semantic latent space of a \textit{frozen} pre-trained diffusion model. Only using a tiny neural network, the proposed ZIP produces diverse content and attributes under the intuitive control of the text prompt. Moreover, ZIP shows remarkable robustness for both in-domain and out-of-domain attribute manipulation on real images. We perform detailed experiments on various benchmark datasets. Compared to state-of-the-art methods, ZIP produces images of equivalent quality while providing a realistic editing effect.
翻译:去噪扩散模型在图像编辑中展现出卓越性能。现有工作多采用图像引导方法(提供视觉参考但缺乏语义一致性控制)或文本引导方法(确保对文本引导的忠实性但视觉质量欠佳)。针对该问题,我们提出零样本逆过程(ZIP)框架,该框架将生成的视觉参考与文本引导的融合信息注入到冻结预训练扩散模型的语义潜空间中。仅使用极小型神经网络,ZIP即可在文本提示的直观控制下生成多样化的内容与属性。此外,ZIP在真实图像的域内和域外属性操纵中均展现出卓越的鲁棒性。我们在多个基准数据集上进行了详细实验。与最先进方法相比,ZIP在保持同等质量图像的同时,实现了更真实的编辑效果。