Recently, how to achieve precise image editing has attracted increasing attention, especially given the remarkable success of text-to-image generation models. To unify various spatial-aware image editing abilities into one framework, we adopt the concept of layers from the design domain to manipulate objects flexibly with various operations. The key insight is to transform the spatial-aware image editing task into a combination of two sub-tasks: multi-layered latent decomposition and multi-layered latent fusion. First, we segment the latent representations of the source images into multiple layers, which include several object layers and one incomplete background layer that necessitates reliable inpainting. To avoid extra tuning, we further explore the inner inpainting ability within the self-attention mechanism. We introduce a key-masking self-attention scheme that can propagate the surrounding context information into the masked region while mitigating its impact on the regions outside the mask. Second, we propose an instruction-guided latent fusion that pastes the multi-layered latent representations onto a canvas latent. We also introduce an artifact suppression scheme in the latent space to enhance the inpainting quality. Due to the inherent modular advantages of such multi-layered representations, we can achieve accurate image editing, and we demonstrate that our approach consistently surpasses the latest spatial editing methods, including Self-Guidance and DiffEditor. Last, we show that our approach is a unified framework that supports various accurate image editing tasks on more than six different editing tasks.
翻译:近期,随着文本到图像生成模型的显著成功,如何实现精准图像编辑引起了越来越多的关注。为了将各种空间感知图像编辑能力统一到一个框架中,我们借鉴设计领域的“层”概念,通过多种操作灵活操控对象。核心思想是将空间感知图像编辑任务转化为两个子任务的组合:多层潜在分解与多层潜在融合。首先,我们将源图像的潜在表示分割为多个层,包括若干对象层和一个需要可靠修复的不完整背景层。为避免额外调优,我们进一步探索自注意力机制中的内在修复能力,引入一种键掩码自注意力方案,该方案可将周围上下文信息传播至掩码区域,同时减轻其对掩码外区域的影响。其次,我们提出一种指令引导的潜在融合方法,将多层潜在表示粘贴到画布潜在表示上。我们还在潜在空间中引入伪影抑制方案以增强修复质量。由于这种多层表示固有的模块化优势,我们能够实现精准的图像编辑,并证明我们的方法始终优于最新的空间编辑方法,包括Self-Guidance和DiffEditor。最后,我们展示该方法是一个统一框架,支持超过六种不同精确图像编辑任务。