We introduce Delta Denoising Score (DDS), a novel scoring function for text-based image editing that guides minimal modifications of an input image towards the content described in a target prompt. DDS leverages the rich generative prior of text-to-image diffusion models and can be used as a loss term in an optimization problem to steer an image towards a desired direction dictated by a text. DDS utilizes the Score Distillation Sampling (SDS) mechanism for the purpose of image editing. We show that using only SDS often produces non-detailed and blurry outputs due to noisy gradients. To address this issue, DDS uses a prompt that matches the input image to identify and remove undesired erroneous directions of SDS. Our key premise is that SDS should be zero when calculated on pairs of matched prompts and images, meaning that if the score is non-zero, its gradients can be attributed to the erroneous component of SDS. Our analysis demonstrates the competence of DDS for text based image-to-image translation. We further show that DDS can be used to train an effective zero-shot image translation model. Experimental results indicate that DDS outperforms existing methods in terms of stability and quality, highlighting its potential for real-world applications in text-based image editing.
翻译:我们提出Delta Denoising Score(DDS)——一种用于文本引导图像编辑的新型评分函数,该函数能引导输入图像沿目标提示所描述内容方向进行最小幅度的修改。DDS利用文本到图像扩散模型的丰富生成先验,可作为优化问题中的损失项,驱动图像向文本指定的目标方向调整。该方法将分数蒸馏采样(SDS)机制应用于图像编辑场景,但研究发现单独使用SDS常因噪声梯度产生模糊且细节缺失的输出。为解决该问题,DDS通过引入与输入图像匹配的提示,识别并消除SDS中的非期望错误方向。核心假设是:当匹配提示与图像成对计算时SDS应为零——若分数非零,其梯度可归因于SDS的错误分量。理论分析证明了DDS在基于文本的图像到图像翻译中的有效性,并进一步表明其可用于训练高效的零样本图像翻译模型。实验结果表明,DDS在稳定性和生成质量上均优于现有方法,凸显了其在文本引导图像编辑实际应用中的潜力。