Diffusion models (DMs) have gained prominence due to their ability to generate high-quality varied images with recent advancements in text-to-image generation. The research focus is now shifting towards the controllability of DMs. A significant challenge within this domain is localized editing, where specific areas of an image are modified without affecting the rest of the content. This paper introduces LIME for localized image editing in diffusion models. LIME does not require user-specified regions of interest (RoI) or additional text input, but rather employs features from pre-trained methods and a straightforward clustering method to obtain precise editing mask. Then, by leveraging cross-attention maps, it refines these segments for finding regions to obtain localized edits. Finally, we propose a novel cross-attention regularization technique that penalizes unrelated cross-attention scores in the RoI during the denoising steps, ensuring localized edits. Our approach, without re-training, fine-tuning and additional user inputs, consistently improves the performance of existing methods in various editing benchmarks. The project page can be found at https://enisimsar.github.io/LIME/.
翻译:扩散模型(DMs)因其在文本到图像生成领域的最新进展而备受关注,能够生成高质量且多样化的图像。当前研究重点正转向扩散模型的可控性。该领域的一个重要挑战是局部编辑,即在修改图像特定区域的同时不影响其余内容。本文提出LIME,用于在扩散模型中实现局部图像编辑。LIME无需用户指定感兴趣区域(RoI)或额外文本输入,而是利用预训练方法提取的特征和简单的聚类方法获取精确的编辑掩码。随后,通过利用交叉注意力图,该方法优化这些分割区域以定位待编辑区域。最后,我们提出一种新颖的交叉注意力正则化技术,在去噪步骤中对RoI内不相关的交叉注意力分数进行惩罚,从而确保编辑的局部性。我们的方法无需重新训练、微调或额外用户输入,在多种编辑基准测试中持续提升了现有方法的性能。项目页面详见 https://enisimsar.github.io/LIME/。