Accuracy and speed are critical in image editing tasks. Pan et al. introduced a drag-based image editing framework that achieves pixel-level control using Generative Adversarial Networks (GANs). A flurry of subsequent studies enhanced this framework's generality by leveraging large-scale diffusion models. However, these methods often suffer from inordinately long processing times (exceeding 1 minute per edit) and low success rates. Addressing these issues head on, we present LightningDrag, a rapid approach enabling high quality drag-based image editing in ~1 second. Unlike most previous methods, we redefine drag-based editing as a conditional generation task, eliminating the need for time-consuming latent optimization or gradient-based guidance during inference. In addition, the design of our pipeline allows us to train our model on large-scale paired video frames, which contain rich motion information such as object translations, changing poses and orientations, zooming in and out, etc. By learning from videos, our approach can significantly outperform previous methods in terms of accuracy and consistency. Despite being trained solely on videos, our model generalizes well to perform local shape deformations not presented in the training data (e.g., lengthening of hair, twisting rainbows, etc.). Extensive qualitative and quantitative evaluations on benchmark datasets corroborate the superiority of our approach. The code and model will be released at https://github.com/magic-research/LightningDrag.
翻译:在图像编辑任务中,准确性和速度至关重要。Pan等人引入了一种基于拖拽的图像编辑框架,该框架利用生成对抗网络实现了像素级控制。随后涌现的大量研究通过利用大规模扩散模型增强了该框架的通用性。然而,这些方法通常存在处理时间过长(每次编辑超过1分钟)和成功率低的问题。针对这些问题,我们提出了LightningDrag,一种能够在约1秒内实现高质量拖拽式图像编辑的快速方法。与大多数先前方法不同,我们将拖拽式编辑重新定义为条件生成任务,从而在推理过程中无需耗时的潜在优化或基于梯度的引导。此外,我们的流程设计允许我们在包含丰富运动信息(如物体平移、姿态与方向变化、缩放等)的大规模成对视频帧上训练模型。通过从视频中学习,我们的方法在准确性和一致性方面显著优于先前方法。尽管仅在视频数据上训练,我们的模型能够很好地泛化,以执行训练数据中未出现的局部形状变形(例如,拉长头发、扭曲彩虹等)。在基准数据集上进行的大量定性和定量评估证实了我们方法的优越性。代码和模型将在 https://github.com/magic-research/LightningDrag 发布。