Score Distillation Sampling (SDS) is a recent but already widely popular method that relies on an image diffusion model to control optimization problems using text prompts. In this paper, we conduct an in-depth analysis of the SDS loss function, identify an inherent problem with its formulation, and propose a surprisingly easy but effective fix. Specifically, we decompose the loss into different factors and isolate the component responsible for noisy gradients. In the original formulation, high text guidance is used to account for the noise, leading to unwanted side effects. Instead, we train a shallow network mimicking the timestep-dependent denoising deficiency of the image diffusion model in order to effectively factor it out. We demonstrate the versatility and the effectiveness of our novel loss formulation through several qualitative and quantitative experiments, including optimization-based image synthesis and editing, zero-shot image translation network training, and text-to-3D synthesis.
翻译:分数蒸馏采样(Score Distillation Sampling, SDS)是一种近期提出但已广泛应用的方法,其原理是利用图像扩散模型通过文本提示控制优化问题。本文对SDS损失函数进行了深入分析,发现了其公式中的一个固有问题,并提出了一种简单但有效的修复方案。具体而言,我们将损失函数解构为不同因子,并分离出导致梯度噪声的组成部分。原始公式中,高文本引导被用于抵消噪声,从而产生不良副作用。取而代之,我们训练了一个浅层网络来模拟图像扩散模型依赖于时间步的去噪缺陷,以有效消除该缺陷。通过多项定性与定量实验(包括基于优化的图像合成与编辑、零样本图像翻译网络训练及文本到3D合成),我们证明了新损失函数的多功能性与有效性。