Classifier guidance -- using the gradients of an image classifier to steer the generations of a diffusion model -- has the potential to dramatically expand the creative control over image generation and editing. However, currently classifier guidance requires either training new noise-aware models to obtain accurate gradients or using a one-step denoising approximation of the final generation, which leads to misaligned gradients and sub-optimal control. We highlight this approximation's shortcomings and propose a novel guidance method: Direct Optimization of Diffusion Latents (DOODL), which enables plug-and-play guidance by optimizing diffusion latents w.r.t. the gradients of a pre-trained classifier on the true generated pixels, using an invertible diffusion process to achieve memory-efficient backpropagation. Showcasing the potential of more precise guidance, DOODL outperforms one-step classifier guidance on computational and human evaluation metrics across different forms of guidance: using CLIP guidance to improve generations of complex prompts from DrawBench, using fine-grained visual classifiers to expand the vocabulary of Stable Diffusion, enabling image-conditioned generation with a CLIP visual encoder, and improving image aesthetics using an aesthetic scoring network.
翻译:分类器引导——利用图像分类器的梯度来引导扩散模型的生成——有望极大地扩展图像生成和编辑的创意控制能力。然而,当前的分类器引导要么需要训练新的噪声感知模型以获得精确梯度,要么使用最终生成结果的一步去噪近似,这会导致梯度不匹配和次优控制。我们指出了这种近似的缺陷,并提出了一种新颖的引导方法:直接优化扩散潜变量(DOODL)。该方法通过可逆扩散过程实现基于真实生成像素的预训练分类器梯度的潜变量优化,从而支持即插即用的引导,并借助可逆扩散过程实现内存高效的反向传播。通过展示更精确引导的潜力,DOODL在计算指标和人工评估指标上均优于一步式分类器引导,涵盖多种引导形式:使用CLIP引导改进DrawBench复杂提示的生成效果,利用细粒度视觉分类器扩展Stable Diffusion的词汇表,通过CLIP视觉编码器实现条件图像生成,以及使用美学评分网络提升图像审美质量。