The primary axes of interest in image-generating diffusion models are image quality, the amount of variation in the results, and how well the results align with a given condition, e.g., a class label or a text prompt. The popular classifier-free guidance approach uses an unconditional model to guide a conditional model, leading to simultaneously better prompt alignment and higher-quality images at the cost of reduced variation. These effects seem inherently entangled, and thus hard to control. We make the surprising observation that it is possible to obtain disentangled control over image quality without compromising the amount of variation by guiding generation using a smaller, less-trained version of the model itself rather than an unconditional model. This leads to significant improvements in ImageNet generation, setting record FIDs of 1.01 for 64x64 and 1.25 for 512x512, using publicly available networks. Furthermore, the method is also applicable to unconditional diffusion models, drastically improving their quality.
翻译:图像生成扩散模型的主要关注维度包括图像质量、生成结果的多样性,以及生成结果与给定条件(例如类别标签或文本提示)的对齐程度。目前流行的无分类器引导方法使用无条件模型来引导条件模型,这能在提升提示对齐度和图像质量的同时,导致多样性下降。这些效应似乎本质上是纠缠在一起的,因此难以独立控制。我们通过观察发现一个令人惊讶的现象:通过使用模型自身的一个更小、训练更不充分的版本来引导生成,而非使用无条件模型,可以在不损害多样性的前提下实现对图像质量的解耦控制。该方法在ImageNet生成任务中取得了显著改进,使用公开可用的网络,在64x64分辨率上获得了1.01的FID记录,在512x512分辨率上获得了1.25的FID记录。此外,该方法同样适用于无条件扩散模型,能大幅提升其生成质量。