Diffusion models transform noise into data by injecting information that was captured in their neural network during the training phase. In this paper, we ask: \textit{what} is this information? We find that, in pixel-space diffusion models, (1) a large fraction of the total information in the neural network is committed to reconstructing small-scale perceptual details of the image, and (2) the correlations between images and their class labels are informed by the semantic content of the images, and are largely agnostic to the low-level details. We argue that these properties are intrinsically tied to the manifold structure of the data itself. Finally, we show that these facts explain the efficacy of classifier-free guidance: the guidance vector amplifies the mutual information between images and conditioning signals early in the generative process, influencing semantic structure, but tapers out as perceptual details are filled in.
翻译:扩散模型通过在训练阶段捕获到神经网络中的信息注入,将噪声转化为数据。本文探讨的核心问题是:\textit{这些信息究竟是什么?} 我们发现,在像素空间扩散模型中:(1)神经网络中的总信息有很大一部分被用于重建图像的小尺度感知细节;(2)图像与其类别标签之间的相关性由图像的语义内容所决定,并且很大程度上与低级细节无关。我们认为,这些特性本质上与数据本身的流形结构紧密相关。最后,我们证明这些事实解释了无分类器引导的有效性:引导向量在生成过程早期增强了图像与条件信号之间的互信息,从而影响语义结构,但随着感知细节的填充,其作用逐渐减弱。