Text-guided image generation and editing using diffusion models have achieved remarkable advancements. Among these, tuning-free methods have gained attention for their ability to perform edits without extensive model adjustments, offering simplicity and efficiency. However, existing tuning-free approaches often struggle with balancing fidelity and editing precision. Reconstruction errors in DDIM Inversion are partly attributed to the cross-attention mechanism in U-Net, which introduces misalignments during the inversion and reconstruction process. To address this, we analyze reconstruction from a structural perspective and propose a novel approach that replaces traditional cross-attention with uniform attention maps, significantly enhancing image reconstruction fidelity. Our method effectively minimizes distortions caused by varying text conditions during noise prediction. To complement this improvement, we introduce an adaptive mask-guided editing technique that integrates seamlessly with our reconstruction approach, ensuring consistency and accuracy in editing tasks. Experimental results demonstrate that our approach not only excels in achieving high-fidelity image reconstruction but also performs robustly in real image composition and editing scenarios. This study underscores the potential of uniform attention maps to enhance the fidelity and versatility of diffusion-based image processing methods. Code is available at https://github.com/Mowenyii/Uniform-Attention-Maps.
翻译:基于扩散模型的文本引导图像生成与编辑技术已取得显著进展。其中,免调优方法因其无需大量模型调整即可执行编辑的能力而备受关注,具有简洁高效的特点。然而,现有的免调优方法往往难以平衡保真度与编辑精度。DDIM反演中的重建误差部分归因于U-Net中的交叉注意力机制,该机制在反演和重建过程中引入了错位问题。为解决此问题,我们从结构角度分析重建过程,提出一种创新方法,用均匀注意力图替代传统交叉注意力,显著提升了图像重建保真度。该方法能有效减少噪声预测过程中因文本条件变化导致的失真。为配合此改进,我们进一步提出自适应掩码引导编辑技术,该技术与我们的重建方法无缝集成,确保了编辑任务的一致性与准确性。实验结果表明,我们的方法不仅在实现高保真图像重建方面表现优异,在真实图像合成与编辑场景中也展现出鲁棒性能。本研究揭示了均匀注意力图在提升基于扩散的图像处理方法保真度与多功能性方面的潜力。代码发布于https://github.com/Mowenyii/Uniform-Attention-Maps。