Diffusion-based Image Editing (DIE) is an emerging research hot-spot, which often applies a semantic mask to control the target area for diffusion-based editing. However, most existing solutions obtain these masks via manual operations or off-line processing, greatly reducing their efficiency. In this paper, we propose a novel and efficient image editing method for Text-to-Image (T2I) diffusion models, termed Instant Diffusion Editing(InstDiffEdit). In particular, InstDiffEdit aims to employ the cross-modal attention ability of existing diffusion models to achieve instant mask guidance during the diffusion steps. To reduce the noise of attention maps and realize the full automatics, we equip InstDiffEdit with a training-free refinement scheme to adaptively aggregate the attention distributions for the automatic yet accurate mask generation. Meanwhile, to supplement the existing evaluations of DIE, we propose a new benchmark called Editing-Mask to examine the mask accuracy and local editing ability of existing methods. To validate InstDiffEdit, we also conduct extensive experiments on ImageNet and Imagen, and compare it with a bunch of the SOTA methods. The experimental results show that InstDiffEdit not only outperforms the SOTA methods in both image quality and editing results, but also has a much faster inference speed, i.e., +5 to +6 times. Our code available at https://anonymous.4open.science/r/InstDiffEdit-C306/
翻译:基于扩散的图像编辑(DIE)是当前新兴的研究热点,通常需借助语义掩码控制扩散编辑的目标区域。然而,现有方法多通过人工操作或离线处理获取掩码,导致效率大幅降低。本文提出一种面向文本到图像(T2I)扩散模型的高效新颖图像编辑方法——即时扩散编辑(InstDiffEdit)。具体而言,InstDiffEdit旨在利用现有扩散模型的跨模态注意力能力,在扩散步骤中实现即时掩码引导。为降低注意力图的噪声并实现全自动化,我们为InstDiffEdit配备了一种无需训练的细化方案,通过自适应聚合注意力分布实现自动且精确的掩码生成。同时,为补充现有DIE评估体系,我们提出名为Editing-Mask的新基准,用于检验现有方法的掩码精度与局部编辑能力。为验证InstDiffEdit,我们在ImageNet和Imagen上开展大量实验,并与多个最先进(SOTA)方法进行对比。实验结果表明,InstDiffEdit不仅在图像质量与编辑效果上超越SOTA方法,更具备显著的推理速度优势(提升5至6倍)。代码已开源:https://anonymous.4open.science/r/InstDiffEdit-C306/