Text-to-image (T2I) diffusion models, with their impressive generative capabilities, have been adopted for image editing tasks, demonstrating remarkable efficacy. However, due to attention leakage and collision between the cross-attention map of the object and the new color attribute from the text prompt, text-guided image editing methods may fail to change the color of an object, resulting in a misalignment between the resulting image and the text prompt. In this paper, we conduct an in-depth analysis on the process of text-guided image synthesizing and what semantic information different cross-attention blocks have learned. We observe that the visual representation of an object is determined in the up-block of the diffusion model in the early stage of the denoising process, and color adjustment can be achieved through value matrices alignment in the cross-attention layer. Based on our findings, we propose a straightforward, yet stable, and effective image-guided method to modify the color of an object without requiring any additional fine-tuning or training. Lastly, we present a benchmark dataset called COLORBENCH, the first benchmark to evaluate the performance of color change methods. Extensive experiments validate the effectiveness of our method in object-level color editing and surpass the performance of popular text-guided image editing approaches in both synthesized and real images.
翻译:文本到图像(T2I)扩散模型凭借其卓越的生成能力,已被应用于图像编辑任务,并展现出显著的效果。然而,由于注意力泄漏以及对象交叉注意力图与文本提示中新颜色属性之间的冲突,文本引导的图像编辑方法可能无法成功改变对象的颜色,导致生成图像与文本提示之间出现错位。本文对文本引导图像合成的过程以及不同交叉注意力模块所学习的语义信息进行了深入分析。我们观察到,在去噪过程的早期阶段,对象的视觉表征由扩散模型的上采样块决定,且色彩调整可通过交叉注意力层中的值矩阵对齐来实现。基于这些发现,我们提出了一种简单、稳定且有效的图像引导方法,无需任何额外的微调或训练即可修改对象的颜色。最后,我们提出了一个名为COLORBENCH的基准数据集,这是首个用于评估色彩变换方法性能的基准。大量实验验证了我们的方法在对象级色彩编辑中的有效性,并在合成图像和真实图像上均超越了主流文本引导图像编辑方法的性能。