With the remarkable advent of text-to-image diffusion models, image editing methods have become more diverse and continue to evolve. A promising recent approach in this realm is Delta Denoising Score (DDS) - an image editing technique based on Score Distillation Sampling (SDS) framework that leverages the rich generative prior of text-to-image diffusion models. However, relying solely on the difference between scoring functions is insufficient for preserving specific structural elements from the original image, a crucial aspect of image editing. Inspired by the similarity and importance differences between DDS and the contrastive learning for unpaired image-to-image translation (CUT), here we present an embarrassingly simple yet very powerful modification of DDS, called Contrastive Denoising Score (CDS), for latent diffusion models (LDM). Specifically, to enforce structural correspondence between the input and output while maintaining the controllability of contents, we introduce a straightforward approach to regulate structural consistency using CUT loss within the DDS framework. To calculate this loss, instead of employing auxiliary networks, we utilize the intermediate features of LDM, in particular, those from the self-attention layers, which possesses rich spatial information. Our approach enables zero-shot image-to-image translation and neural radiance field (NeRF) editing, achieving a well-balanced interplay between maintaining the structural details and transforming content. Qualitative results and comparisons demonstrates the effectiveness of our proposed method. Project page with code is available at https://hyelinnam.github.io/CDS/.
翻译:随着文本到图像扩散模型的显著进步,图像编辑方法变得多样化且持续演进。该领域近期一个颇具前景的方法是Delta去噪分数(DDS)——一种基于分数蒸馏采样(SDS)框架的图像编辑技术,它利用了文本到图像扩散模型的丰富生成先验。然而,仅依靠评分函数之间的差异不足以保留原始图像中的特定结构元素,而这是图像编辑的关键方面。受DDS与无配对图像到图像翻译对比学习(CUT)之间相似性和重要性差异的启发,我们提出了一种对DDS的简单但极其有效的改进,称为对比去噪分数(CDS),用于潜在扩散模型(LDM)。具体而言,为了在保持内容可控性的同时强化输入与输出之间的结构对应关系,我们引入了一种直接的方法,通过CUT损失在DDS框架内调控结构一致性。为计算该损失,我们不使用辅助网络,而是利用LDM的中间特征,特别是自注意力层中富含空间信息的特征。我们的方法支持零样本图像到图像翻译和神经辐射场(NeRF)编辑,在保持结构细节与转换内容之间实现了良好的平衡。定性结果与比较验证了我们所提方法的有效性。项目和代码页面见https://hyelinnam.github.io/CDS/。