Recent diffusion-based image editing approaches have exhibited impressive editing capabilities in images with simple compositions. However, localized editing in complex scenarios has not been well-studied in the literature, despite its growing real-world demands. Existing mask-based inpainting methods fall short of retaining the underlying structure within the edit region. Meanwhile, mask-free attention-based methods often exhibit editing leakage and misalignment in more complex compositions. In this work, we develop MAG-Edit, a training-free, inference-stage optimization method, which enables localized image editing in complex scenarios. In particular, MAG-Edit optimizes the noise latent feature in diffusion models by maximizing two mask-based cross-attention constraints of the edit token, which in turn gradually enhances the local alignment with the desired prompt. Extensive quantitative and qualitative experiments demonstrate the effectiveness of our method in achieving both text alignment and structure preservation for localized editing within complex scenarios.
翻译:近期基于扩散模型的图像编辑方法在简单构图的图像中展现出令人瞩目的编辑能力。然而,尽管实际应用需求日益增长,复杂场景下的局部编辑问题在现有文献中尚未得到充分研究。现有的基于掩码的图像修复方法难以保留编辑区域内的底层结构,而基于注意力的无掩码方法在更复杂的构图中常出现编辑泄露与对齐偏差。针对上述问题,本文提出MAG-Edit——一种无需训练、仅需推理阶段优化的方法,可在复杂场景中实现局部图像编辑。具体而言,MAG-Edit通过最大化编辑标记的两个基于掩码的交叉注意力约束来优化扩散模型中的噪声隐特征,从而逐步增强局部区域与目标提示的对齐效果。大量定量与定性实验表明,该方法在复杂场景下的局部编辑中能同时实现文本对齐与结构保持的有效性。