Diffusion inversion is the problem of taking an image and a text prompt that describes it and finding a noise latent that would generate the image. Most current inversion techniques operate by approximately solving an implicit equation and may converge slowly or yield poor reconstructed images. Here, we formulate the problem as finding the roots of an implicit equation and design a method to solve it efficiently. Our solution is based on Newton-Raphson (NR), a well-known technique in numerical analysis. A naive application of NR may be computationally infeasible and tends to converge to incorrect solutions. We describe an efficient regularized formulation that converges quickly to a solution that provides high-quality reconstructions. We also identify a source of inconsistency stemming from prompt conditioning during the inversion process, which significantly degrades the inversion quality. To address this, we introduce a prompt-aware adjustment of the encoding, effectively correcting this issue. Our solution, Regularized Newton-Raphson Inversion, inverts an image within 0.5 sec for latent consistency models, opening the door for interactive image editing. We further demonstrate improved results in image interpolation and generation of rare objects.
翻译:扩散反演是指给定一张图像及其描述文本提示,寻找能够生成该图像的噪声潜变量的问题。当前大多数反演技术通过近似求解隐式方程实现,可能收敛缓慢或重建图像质量较差。本文将该问题表述为寻找隐式方程的根,并设计了一种高效求解方法。我们的解决方案基于数值分析中的经典技术——牛顿-拉夫森法。直接应用牛顿-拉夫森法可能计算不可行,且易收敛至错误解。我们提出了一种高效的正则化公式,能够快速收敛到提供高质量重建的解。同时,我们发现反演过程中提示条件导致的非一致性会显著降低反演质量。为此,我们引入了编码的提示感知调整机制,有效修正了该问题。我们提出的正则化牛顿-拉夫森反演方法可在0.5秒内完成潜在一致性模型的图像反演,为交互式图像编辑开辟了新途径。我们进一步展示了该方法在图像插值和稀有物体生成方面的改进效果。