Diffusion models, though originally designed for generative tasks, have demonstrated impressive self-supervised representation learning capabilities. A particularly intriguing phenomenon in these models is the emergence of unimodal representation dynamics, where the quality of learned features peaks at an intermediate noise level. In this work, we conduct a comprehensive theoretical and empirical investigation of this phenomenon. Leveraging the inherent low-dimensionality structure of image data, we theoretically demonstrate that the unimodal dynamic emerges when the diffusion model successfully captures the underlying data distribution. The unimodality arises from an interplay between denoising strength and class confidence across noise scales. Empirically, we further show that, in classification tasks, the presence of unimodal dynamics reliably reflects the generalization of the diffusion model: it emerges when the model generates novel images and gradually transitions to a monotonically decreasing curve as the model begins to memorize the training data.
翻译:扩散模型虽然最初设计用于生成任务,但已展现出卓越的自监督表示学习能力。这些模型中一个特别引人注目的现象是单峰表示动态的出现,即所学特征的质量在中等噪声水平达到峰值。本研究对这一现象进行了全面的理论与实证分析。利用图像数据固有的低维结构,我们从理论上证明:当扩散模型成功捕捉到底层数据分布时,单峰动态就会显现。这种单峰性源于不同噪声尺度下去噪强度与类别置信度之间的相互作用。实证研究表明,在分类任务中,单峰动态的存在可靠地反映了扩散模型的泛化能力:当模型生成新颖图像时该动态出现,而随着模型开始记忆训练数据,它会逐渐转变为单调递减曲线。