Recent empirical studies have demonstrated that diffusion models can effectively learn the image distribution and generate new samples. Remarkably, these models can achieve this even with a small number of training samples despite a large image dimension, circumventing the curse of dimensionality. In this work, we provide theoretical insights into this phenomenon by leveraging key empirical observations: (i) the low intrinsic dimensionality of image data, (ii) a union of manifold structure of image data, and (iii) the low-rank property of the denoising autoencoder in trained diffusion models. These observations motivate us to assume the underlying data distribution of image data as a mixture of low-rank Gaussians and to parameterize the denoising autoencoder as a low-rank model according to the score function of the assumed distribution. With these setups, we rigorously show that optimizing the training loss of diffusion models is equivalent to solving the canonical subspace clustering problem over the training samples. Based on this equivalence, we further show that the minimal number of samples required to learn the underlying distribution scales linearly with the intrinsic dimensions under the above data and model assumptions. This insight sheds light on why diffusion models can break the curse of dimensionality and exhibit the phase transition in learning distributions. Moreover, we empirically establish a correspondence between the subspaces and the semantic representations of image data, facilitating image editing. We validate these results with corroborated experimental results on both simulated distributions and image datasets.
翻译:最近的实证研究表明,扩散模型能够有效学习图像分布并生成新样本。值得注意的是,即使训练样本数量较少而图像维度较高,这些模型仍能实现这一目标,从而规避了维度灾难。本研究通过利用关键实证观察为这一现象提供理论解释:(i) 图像数据的低本征维度特性,(ii) 图像数据的流形并集结构,以及 (iii) 已训练扩散模型中降噪自编码器的低秩特性。这些观察促使我们将图像数据的基础分布假设为低秩高斯混合模型,并根据假设分布的评分函数将降噪自编码器参数化为低秩模型。在此设定下,我们严格证明优化扩散模型的训练损失等价于对训练样本求解经典子空间聚类问题。基于该等价关系,我们进一步证明在上述数据和模型假设下,学习基础分布所需的最小样本量与本征维度呈线性比例关系。这一发现揭示了扩散模型能够突破维度灾难并在分布学习中呈现相变现象的内在机理。此外,我们通过实证建立了子空间与图像数据语义表征之间的对应关系,为图像编辑提供了便利。我们通过在模拟分布和图像数据集上的验证实验证实了这些结论。