Precise image editing with text-to-image models has attracted increasing interest due to their remarkable generative capabilities and user-friendly nature. However, such attempts face the pivotal challenge of misalignment between the intended precise editing target regions and the broader area impacted by the guidance in practice. Despite excellent methods leveraging attention mechanisms that have been developed to refine the editing guidance, these approaches necessitate modifications through complex network architecture and are limited to specific editing tasks. In this work, we re-examine the diffusion process and misalignment problem from a frequency perspective, revealing that, due to the power law of natural images and the decaying noise schedule, the denoising network primarily recovers low-frequency image components during the earlier timesteps and thus brings excessive low-frequency signals for editing. Leveraging this insight, we introduce a novel fine-tuning free approach that employs progressive $\textbf{Fre}$qu$\textbf{e}$ncy truncation to refine the guidance of $\textbf{Diff}$usion models for universal editing tasks ($\textbf{FreeDiff}$). Our method achieves comparable results with state-of-the-art methods across a variety of editing tasks and on a diverse set of images, highlighting its potential as a versatile tool in image editing applications.
翻译:基于文本到图像模型的精确图像编辑因其卓越的生成能力和用户友好性而日益受到关注。然而,此类尝试面临一个关键挑战:在实践中,预期的精确编辑目标区域与引导实际影响的更广泛区域之间存在错位。尽管已有利用注意力机制的优秀方法被开发出来以优化编辑引导,但这些方法需要通过复杂的网络架构进行修改,并且仅限于特定的编辑任务。在本工作中,我们从频率视角重新审视扩散过程与错位问题,揭示出由于自然图像的幂律特性以及衰减的噪声调度,去噪网络在较早的时间步中主要恢复图像的低频分量,从而为编辑引入了过多的低频信号。基于这一洞见,我们提出了一种无需微调的新方法,该方法采用渐进式频率截断来优化扩散模型的引导,以适用于通用编辑任务。我们的方法在多种编辑任务和多样化图像集上均取得了与最先进方法相媲美的结果,凸显了其作为图像编辑应用中通用工具的潜力。