Diffusion models trained with mean squared error loss tend to generate unrealistic samples. Current state-of-the-art models rely on classifier-free guidance to improve sample quality, yet its surprising effectiveness is not fully understood. In this paper, We show that the effectiveness of classifier-free guidance partly originates from it being a form of implicit perceptual guidance. As a result, we can directly incorporate perceptual loss in diffusion training to improve sample quality. Since the score matching objective used in diffusion training strongly resembles the denoising autoencoder objective used in unsupervised training of perceptual networks, the diffusion model itself is a perceptual network and can be used to generate meaningful perceptual loss. We propose a novel self-perceptual objective that results in diffusion models capable of generating more realistic samples. For conditional generation, our method only improves sample quality without entanglement with the conditional input and therefore does not sacrifice sample diversity. Our method can also improve sample quality for unconditional generation, which was not possible with classifier-free guidance before.
翻译:用均方误差损失训练的扩散模型倾向于生成不真实的样本。当前最先进的模型依赖于无分类器引导来提升样本质量,但其惊人的有效性尚未被完全理解。本文表明,无分类器引导的有效性部分源于其是一种隐式感知引导的形式。因此,我们可以直接在扩散训练中引入感知损失以提升样本质量。由于扩散训练中使用的分数匹配目标与感知网络无监督训练中使用的去噪自编码器目标高度相似,扩散模型本身就是一个感知网络,可用于生成有意义的感知损失。我们提出一种新颖的自感知目标函数,使扩散模型能够生成更真实的样本。对于条件生成,我们的方法仅提升样本质量而不与条件输入纠缠,因此不会牺牲样本多样性。我们的方法还能提升无条件生成的质量,而这在之前是无分类器引导所无法实现的。