Diffusion models trained with mean squared error loss tend to generate unrealistic samples. Current state-of-the-art models rely on classifier-free guidance to improve sample quality, yet its surprising effectiveness is not fully understood. In this paper, we show that the effectiveness of classifier-free guidance partly originates from it being a form of implicit perceptual guidance. As a result, we can directly incorporate perceptual loss in diffusion training to improve sample quality. Since the score matching objective used in diffusion training strongly resembles the denoising autoencoder objective used in unsupervised training of perceptual networks, the diffusion model itself is a perceptual network and can be used to generate meaningful perceptual loss. We propose a novel self-perceptual objective that results in diffusion models capable of generating more realistic samples. For conditional generation, our method only improves sample quality without entanglement with the conditional input and therefore does not sacrifice sample diversity. Our method can also improve sample quality for unconditional generation, which was not possible with classifier-free guidance before.
翻译:采用均方误差损失训练的扩散模型倾向于生成不真实的样本。当前最先进的模型依赖无分类器引导来提升样本质量,但其惊人的有效性尚未被完全理解。本文表明,无分类器引导的有效性部分源于它作为一种隐式感知引导形式。因此,我们可以直接在扩散训练中融入感知损失以改善样本质量。由于扩散训练中使用的分数匹配目标与感知网络无监督训练中使用的去噪自编码器目标高度相似,扩散模型本身就是一个感知网络,可用于生成有意义的感知损失。我们提出一种新颖的自感知目标函数,使得扩散模型能够生成更真实的样本。对于条件生成,我们的方法仅在提升样本质量时不与条件输入纠缠,因此不会牺牲样本多样性。该方法还能改善无条件生成中的样本质量,而这在之前的无分类器引导中是无法实现的。