Object-aware Inversion and Reassembly for Image Editing

By comparing the original and target prompts in editing task, we can obtain numerous editing pairs, each comprising an object and its corresponding editing target. To allow editability while maintaining fidelity to the input image, existing editing methods typically involve a fixed number of inversion steps that project the whole input image to its noisier latent representation, followed by a denoising process guided by the target prompt. However, we find that the optimal number of inversion steps for achieving ideal editing results varies significantly among different editing pairs, owing to varying editing difficulties. Therefore, the current literature, which relies on a fixed number of inversion steps, produces sub-optimal generation quality, especially when handling multiple editing pairs in a natural image. To this end, we propose a new image editing paradigm, dubbed Object-aware Inversion and Reassembly (OIR), to enable object-level fine-grained editing. Specifically, we design a new search metric, which determines the optimal inversion steps for each editing pair, by jointly considering the editability of the target and the fidelity of the non-editing region. We use our search metric to find the optimal inversion step for each editing pair when editing an image. We then edit these editing pairs separately to avoid concept mismatch. Subsequently, we propose an additional reassembly step to seamlessly integrate the respective editing results and the non-editing region to obtain the final edited image. To systematically evaluate the effectiveness of our method, we collect two datasets for benchmarking single- and multi-object editing, respectively. Experiments demonstrate that our method achieves superior performance in editing object shapes, colors, materials, categories, etc., especially in multi-object editing scenarios.

翻译：通过对比编辑任务中原始提示与目标提示，我们可以获得大量编辑对，每对包含一个对象及其对应的编辑目标。现有编辑方法通常采用固定步数的逆过程，将输入图像整体投影至噪声更强的隐空间表征，再通过目标提示引导的去噪过程生成结果，以此在保持对输入图像保真度的同时实现可编辑性。然而我们发现，由于不同编辑对的编辑难度存在差异，实现理想编辑效果所需的最优逆步骤数量差异显著。因此，当前依赖固定逆步骤数量的文献会产生次优的生成质量，尤其在处理自然图像中的多个编辑对时表现更为明显。为此，我们提出名为"面向对象的反转与重组"（OIR）的新型图像编辑范式，实现对象级别的细粒度编辑。具体而言，我们设计新的搜索度量标准，通过联合考虑目标的可编辑性与非编辑区域的保真度，为每个编辑对确定最优逆步骤数量。在图像编辑过程中，我们使用该搜索度量为每个编辑对寻找最优逆步骤，随后分别对这些编辑对进行编辑以避免概念混淆。在此基础上，我们提出额外的重组步骤，将各自的编辑结果与非编辑区域无缝融合，获得最终编辑图像。为系统评估方法有效性，我们分别收集了用于单对象和多对象编辑基准测试的两个数据集。实验表明，本方法在编辑对象形状、颜色、材质、类别等方面均表现优异，尤其在多对象编辑场景中效果更为突出。