Diffusion models trained with mean squared error loss tend to generate unrealistic samples. Current state-of-the-art models rely on classifier-free guidance to improve sample quality, yet its surprising effectiveness is not fully understood. In this paper, we show that the effectiveness of classifier-free guidance partly originates from it being a form of implicit perceptual guidance. As a result, we can directly incorporate perceptual loss in diffusion training to improve sample quality. Since the score matching objective used in diffusion training strongly resembles the denoising autoencoder objective used in unsupervised training of perceptual networks, the diffusion model itself is a perceptual network and can be used to generate meaningful perceptual loss. We propose a novel self-perceptual objective that results in diffusion models capable of generating more realistic samples. For conditional generation, our method only improves sample quality without entanglement with the conditional input and therefore does not sacrifice sample diversity. Our method can also improve sample quality for unconditional generation, which was not possible with classifier-free guidance before.
翻译:使用均方误差损失训练的扩散模型倾向于生成不真实的样本。当前最先进的模型依赖于无分类器引导来提升样本质量,但其令人惊讶的有效性尚未被完全理解。本文表明,无分类器引导的有效性部分源于它是一种隐式感知引导形式。因此,我们可以在扩散训练中直接引入感知损失以提升样本质量。由于扩散训练中使用的分数匹配目标与感知网络无监督训练中使用的去噪自编码器目标高度相似,扩散模型本身就是一个感知网络,可用于生成有意义的感知损失。我们提出了一种新颖的自感知目标函数,使扩散模型能够生成更真实的样本。对于条件生成,我们的方法仅提升样本质量而不与条件输入纠缠,因此不会牺牲样本多样性。我们的方法还能提升无条件生成的样本质量,这在之前的无分类器引导中是无法实现的。