Large-scale text-to-image generative models have shown their remarkable ability to synthesize diverse and high-quality images. However, it is still challenging to directly apply these models for editing real images for two reasons. First, it is hard for users to come up with a perfect text prompt that accurately describes every visual detail in the input image. Second, while existing models can introduce desirable changes in certain regions, they often dramatically alter the input content and introduce unexpected changes in unwanted regions. In this work, we propose pix2pix-zero, an image-to-image translation method that can preserve the content of the original image without manual prompting. We first automatically discover editing directions that reflect desired edits in the text embedding space. To preserve the general content structure after editing, we further propose cross-attention guidance, which aims to retain the cross-attention maps of the input image throughout the diffusion process. In addition, our method does not need additional training for these edits and can directly use the existing pre-trained text-to-image diffusion model. We conduct extensive experiments and show that our method outperforms existing and concurrent works for both real and synthetic image editing.
翻译:大规模文本到图像生成模型展现了合成多样且高质量图像的卓越能力。然而,直接应用这些模型编辑真实图像仍面临两大挑战:首先,用户难以构建精准的文本提示,以准确描述输入图像中的每个视觉细节;其次,现有模型虽能在特定区域引入理想变化,却常大幅改变输入内容,并在非目标区域引发意外改动。本文提出pix2pix-zero——一种无需手动提示即可保留原始图像内容的图像到图像翻译方法。我们首先在文本嵌入空间中自动发现反映所需编辑的编辑方向,为保留编辑后图像的整体内容结构,进一步提出交叉注意力引导机制,旨在扩散过程中保持输入图像的交叉注意力图。此外,本方法无需针对这些编辑进行额外训练,可直接利用现有预训练文本到图像扩散模型。大量实验表明,在真实与合成图像编辑任务中,本方法均优于现有及同期相关工作。