We present a novel algorithm for text-driven image-to-image translation based on a pretrained text-to-image diffusion model. Our method aims to generate a target image by selectively editing the regions of interest in a source image, defined by a modifying text, while preserving the remaining parts. In contrast to existing techniques that solely rely on a target prompt, we introduce a new score function, which considers both a source prompt and a source image, tailored to address specific translation tasks. To this end, we derive the conditional score function in a principled manner, decomposing it into a standard score and a guiding term for target image generation. For the gradient computation, we adopt a Gaussian distribution of the posterior distribution, estimating its mean and variance without requiring additional training. In addition, to enhance the conditional score guidance, we incorporate a simple yet effective mixup method. This method combines two cross-attention maps derived from the source and target latents, promoting the generation of the target image by a desirable fusion of the original parts in the source image and the edited regions aligned with the target prompt. Through comprehensive experiments, we demonstrate that our approach achieves outstanding image-to-image translation performance on various tasks.
翻译:我们提出了一种基于预训练文本到图像扩散模型的文本驱动图像到图像翻译新算法。我们的方法旨在通过选择性编辑源图像中由修改文本定义的目标区域,同时保留其余部分,生成目标图像。与仅依赖目标提示的现有技术不同,我们引入了一种新的分数函数,该函数同时考虑源提示和源图像,针对特定翻译任务定制。为此,我们以原理性方式推导了条件分数函数,将其分解为标准分数和用于目标图像生成的引导项。在梯度计算中,我们采用后验分布的高斯分布假设,估计其均值和方差,无需额外训练。此外,为增强条件分数引导,我们融合了一种简单而有效的混合方法。该方法结合了源自源潜在表示和目标潜在表示的两个交叉注意力图,通过源图像原始部分与目标提示对齐编辑区域的理想融合,促进目标图像的生成。通过全面的实验,我们证明了我们的方法在各种任务上实现了卓越的图像到图像翻译性能。