Accurate and controllable image editing is a challenging task that has attracted significant attention recently. Notably, DragGAN is an interactive point-based image editing framework that achieves impressive editing results with pixel-level precision. However, due to its reliance on generative adversarial networks (GANs), its generality is limited by the capacity of pretrained GAN models. In this work, we extend this editing framework to diffusion models and propose a novel approach DragDiffusion. By harnessing large-scale pretrained diffusion models, we greatly enhance the applicability of interactive point-based editing on both real and diffusion-generated images. Our approach involves optimizing the diffusion latents to achieve precise spatial control. The supervision signal of this optimization process is from the diffusion model's UNet features, which are known to contain rich semantic and geometric information. Moreover, we introduce two additional techniques, namely LoRA fine-tuning and latent-MasaCtrl, to further preserve the identity of the original image. Lastly, we present a challenging benchmark dataset called DragBench -- the first benchmark to evaluate the performance of interactive point-based image editing methods. Experiments across a wide range of challenging cases (e.g., images with multiple objects, diverse object categories, various styles, etc.) demonstrate the versatility and generality of DragDiffusion. Code: https://github.com/Yujun-Shi/DragDiffusion.
翻译:精确且可控的图像编辑是一项具有挑战性的任务,近年来引起了广泛关注。值得注意的是,DragGAN是一种交互式点驱动的图像编辑框架,能够以像素级精度实现令人印象深刻的编辑效果。然而,由于该框架依赖于生成对抗网络(GANs),其通用性受限于预训练GAN模型的容量。在本工作中,我们将这一编辑框架扩展至扩散模型,并提出了一种名为DragDiffusion的新方法。通过利用大规模预训练扩散模型,我们极大地增强了交互式点驱动编辑在真实图像和扩散生成图像上的适用性。我们的方法通过优化扩散潜在表示来实现精确的空间控制,该优化过程的监督信号来源于扩散模型的UNet特征,而这些特征已被证明包含丰富的语义和几何信息。此外,我们引入了两种附加技术——LoRA微调和Latent-MasaCtrl,以进一步保持原始图像的身份特征。最后,我们提出了一个具有挑战性的基准数据集DragBench——这是首个用于评估交互式点驱动图像编辑方法性能的基准。在涵盖多种挑战性场景(例如包含多个物体、不同物体类别、多种风格等的图像)的实验结果表明,DragDiffusion具有广泛的通用性和普适性。代码地址:https://github.com/Yujun-Shi/DragDiffusion。