Text-guided non-rigid editing involves complex edits for input images, such as changing motion or compositions within their surroundings. Since it requires manipulating the input structure, existing methods often struggle with preserving object identity and background, particularly when combined with Stable Diffusion. In this work, we propose a training-free approach for non-rigid editing with Stable Diffusion, aimed at improving the identity preservation quality without compromising editability. Our approach comprises three stages: text optimization, latent inversion, and timestep-aware text injection sampling. Inspired by the recent success of Imagic, we employ their text optimization for smooth editing. Then, we introduce latent inversion to preserve the input image's identity without additional model fine-tuning. To fully utilize the input reconstruction ability of latent inversion, we suggest timestep-aware text inject sampling. This effectively retains the structure of the input image by injecting the source text prompt in early sampling steps and then transitioning to the target prompt in subsequent sampling steps. This strategic approach seamlessly harmonizes with text optimization, facilitating complex non-rigid edits to the input without losing the original identity. We demonstrate the effectiveness of our method in terms of identity preservation, editability, and aesthetic quality through extensive experiments.
翻译:文本引导的非刚性编辑涉及对输入图像的复杂修改,例如改变其周围环境中的运动或构图。由于需要调整输入结构,现有方法往往难以保持物体身份和背景的完整性,尤其是在结合稳定扩散模型(Stable Diffusion)时。本文提出一种基于稳定扩散的无训练非刚性编辑方法,旨在不牺牲可编辑性的前提下提升身份保持质量。该方法包含三个阶段:文本优化、潜空间反演(Latent Inversion)以及时间步感知的文本注入采样。受Imagic近期成功的启发,我们采用其文本优化策略实现平滑编辑;随后引入潜空间反演技术,无需额外模型微调即可保留输入图像的身份特征。为充分利用潜空间反演的输入重建能力,我们提出时间步感知的文本注入采样:在早期采样步骤中注入源文本提示以保留输入图像结构,后续步骤逐步过渡到目标提示。该策略与文本优化无缝协同,在实现输入图像复杂非刚性编辑的同时避免原始身份丢失。通过大量实验,我们从身份保持、可编辑性和美学质量三个维度验证了本方法的有效性。