Text-guided diffusion models have revolutionized image generation and editing, offering exceptional realism and diversity. Specifically, in the context of diffusion-based editing, where a source image is edited according to a target prompt, the process commences by acquiring a noisy latent vector corresponding to the source image via the diffusion model. This vector is subsequently fed into separate source and target diffusion branches for editing. The accuracy of this inversion process significantly impacts the final editing outcome, influencing both essential content preservation of the source image and edit fidelity according to the target prompt. Prior inversion techniques aimed at finding a unified solution in both the source and target diffusion branches. However, our theoretical and empirical analyses reveal that disentangling these branches leads to a distinct separation of responsibilities for preserving essential content and ensuring edit fidelity. Building on this insight, we introduce "Direct Inversion," a novel technique achieving optimal performance of both branches with just three lines of code. To assess image editing performance, we present PIE-Bench, an editing benchmark with 700 images showcasing diverse scenes and editing types, accompanied by versatile annotations and comprehensive evaluation metrics. Compared to state-of-the-art optimization-based inversion techniques, our solution not only yields superior performance across 8 editing methods but also achieves nearly an order of speed-up.
翻译:文本引导的扩散模型在图像生成与编辑领域实现了革命性突破,展现出卓越的逼真度与多样性。具体而言,在基于扩散模型的编辑任务中——根据目标提示对源图像进行编辑——该过程首先通过扩散模型获取与源图像对应的噪声潜变量向量,随后将该向量分别输入源分支与目标分支进行编辑。此反演过程的准确性直接影响最终编辑效果,既关乎源图像关键内容的保留程度,也决定目标提示编辑的保真度。先前的反演技术旨在为源分支与目标分支寻找统一解,但我们的理论与实证分析表明,将这两个分支解耦可实现关键内容保留与编辑保真度的职责分离。基于这一发现,我们提出"直接反演"技术,仅需三行代码即可实现两分支的最优性能。为评估图像编辑效果,我们构建了PIE-Bench基准测试集,包含700张覆盖多样化场景与编辑类型的图像,并配备多维标注与综合评价指标。与基于优化的最先进反演技术相比,我们的方案不仅在8种编辑方法上取得更优性能,更实现了近一个数量级的加速。