Precise and controllable image editing is a challenging task that has attracted significant attention. Recently, DragGAN enables an interactive point-based image editing framework and achieves impressive editing results with pixel-level precision. However, since this method is based on generative adversarial networks (GAN), its generality is upper-bounded by the capacity of the pre-trained GAN models. In this work, we extend such an editing framework to diffusion models and propose DragDiffusion. By leveraging large-scale pretrained diffusion models, we greatly improve the applicability of interactive point-based editing in real world scenarios. While most existing diffusion-based image editing methods work on text embeddings, DragDiffusion optimizes the diffusion latent to achieve precise spatial control. Although diffusion models generate images in an iterative manner, we empirically show that optimizing diffusion latent at one single step suffices to generate coherent results, enabling DragDiffusion to complete high-quality editing efficiently. Extensive experiments across a wide range of challenging cases (e.g., multi-objects, diverse object categories, various styles, etc.) demonstrate the versatility and generality of DragDiffusion.
翻译:精确且可控的图像编辑是一项具有挑战性且备受关注的任务。近期,DragGAN提出了一种基于交互式点的图像编辑框架,并以像素级精度实现了令人瞩目的编辑效果。然而,由于该方法基于生成对抗网络(GAN),其通用性受限于预训练GAN模型的容量。在本工作中,我们将此类编辑框架扩展至扩散模型,并提出DragDiffusion。通过利用大规模预训练扩散模型,我们显著提升了交互式点编辑在现实场景中的适用性。尽管现有基于扩散模型的图像编辑方法多聚焦于文本嵌入,DragDiffusion通过优化扩散潜变量来实现精确的空间控制。虽然扩散模型以迭代方式生成图像,但我们通过实验证明,仅需单步优化扩散潜变量即可生成连贯的结果,从而使DragDiffusion能够高效完成高质量编辑。在大量具有挑战性的案例(如多目标、多样对象类别、多种风格等)上的广泛实验,验证了DragDiffusion的通用性与普适性。