Consistent editing of real images is a challenging task, as it requires performing non-rigid edits (e.g., changing postures) to the main objects in the input image without changing their identity or attributes. To guarantee consistent attributes, some existing methods fine-tune the entire model or the textual embedding for structural consistency, but they are time-consuming and fail to perform non-rigid edits. Other works are tuning-free, but their performances are weakened by the quality of Denoising Diffusion Implicit Model (DDIM) reconstruction, which often fails in real-world scenarios. In this paper, we present a novel approach called Tuning-free Inversion-enhanced Control (TIC), which directly correlates features from the inversion process with those from the sampling process to mitigate the inconsistency in DDIM reconstruction. Specifically, our method effectively obtains inversion features from the key and value features in the self-attention layers, and enhances the sampling process by these inversion features, thus achieving accurate reconstruction and content-consistent editing. To extend the applicability of our method to general editing scenarios, we also propose a mask-guided attention concatenation strategy that combines contents from both the inversion and the naive DDIM editing processes. Experiments show that the proposed method outperforms previous works in reconstruction and consistent editing, and produces impressive results in various settings.
翻译:真实图像的一致性编辑是一项具有挑战性的任务,因为它需要在保持输入图像中主要对象的身份或属性不变的前提下,执行非刚性编辑(例如改变姿势)。为确保属性一致性,一些现有方法通过微调整个模型或文本嵌入来维持结构一致性,但这些方法耗时且无法执行非刚性编辑。另一些方法虽无需调优,但其性能受限于去噪扩散隐式模型(DDIM)重建的质量——该模型在真实场景中常会失效。本文提出一种名为“无调优逆增强控制”(TIC)的新方法,该方法直接将逆过程特征与采样过程特征相关联,以缓解DDIM重建中的不一致性。具体而言,我们的方法有效地从自注意力层的键值和值特征中获取逆特征,并通过这些逆特征增强采样过程,从而实现精确重建与内容一致的编辑。为将方法推广至通用编辑场景,我们还提出了一种掩码引导的注意力拼接策略,该策略结合了逆过程与基础DDIM编辑过程的内容。实验表明,所提方法在重建与一致性编辑方面优于先前工作,并在多种设置下取得了令人瞩目的成果。