Diffusion models (DMs) can generate realistic images with text guidance using large-scale datasets. However, they demonstrate limited controllability in the output space of the generated images. We propose a novel learning method for text-guided image editing, namely \texttt{iEdit}, that generates images conditioned on a source image and a textual edit prompt. As a fully-annotated dataset with target images does not exist, previous approaches perform subject-specific fine-tuning at test time or adopt contrastive learning without a target image, leading to issues on preserving the fidelity of the source image. We propose to automatically construct a dataset derived from LAION-5B, containing pseudo-target images with their descriptive edit prompts given input image-caption pairs. This dataset gives us the flexibility of introducing a weakly-supervised loss function to generate the pseudo-target image from the latent noise of the source image conditioned on the edit prompt. To encourage localised editing and preserve or modify spatial structures in the image, we propose a loss function that uses segmentation masks to guide the editing during training and optionally at inference. Our model is trained on the constructed dataset with 200K samples and constrained GPU resources. It shows favourable results against its counterparts in terms of image fidelity, CLIP alignment score and qualitatively for editing both generated and real images.
翻译:扩散模型(DMs)可利用大规模数据集通过文本引导生成逼真的图像。然而,它们在生成图像的输出空间中表现出有限的可控性。我们提出了一种新颖的文本引导图像编辑学习方法,即\texttt{iEdit},该方法根据源图像和文本编辑提示生成图像。由于不存在带有目标图像的完全标注数据集,先前的方法在测试时进行特定对象的微调,或采用无目标图像的对比学习,导致源图像保真度难以维持。我们建议自动构建一个源自LAION-5B的数据集,其中包含伪目标图像及其给定输入图像-描述对的描述性编辑提示。该数据集使我们能够灵活地引入弱监督损失函数,从源图像的潜在噪声中生成以编辑提示为条件的伪目标图像。为鼓励局部编辑并保留或修改图像中的空间结构,我们提出一种利用分割掩码在训练期间(以及可选地在推理时)指导编辑的损失函数。我们的模型在包含20万样本的构建数据集上训练,且GPU资源受限。在图像保真度、CLIP对齐分数方面,以及针对生成图像和真实图像的编辑定性评估中,其均展现出优于同类方法的性能。