Diffusion models are powerful generative models that produce high-quality samples from complex data. While their infinite-data behavior is well understood, their generalization with finite data remains less clear. Classical learning theory predicts that generalization occurs at a sample complexity that is exponential in the dimension, far exceeding practical needs. We address this gap by analyzing diffusion models through the lens of data covariance spectra, which often follow power-law decays, reflecting the hierarchical structure of real data. To understand whether such a hierarchical structure can benefit learning in diffusion models, we develop a theoretical framework based on linear neural networks, congruent with a Gaussian hypothesis on the data. We quantify how the hierarchical organization of variance in the data and regularization impacts generalization. We find two regimes: When $N <d$, not all directions of variation are present in the training data, which results in a large gap between training and test loss. In this regime, we demonstrate how a strongly hierarchical data structure, as well as regularization and early stopping help to prevent overfitting. For $N > d$, we find that the sampling distributions of linear diffusion models approach their optimum (measured by the Kullback-Leibler divergence) linearly with $d/N$, independent of the specifics of the data distribution. Our work clarifies how sample complexity governs generalization in a simple model of diffusion-based generative models.
翻译:扩散模型是强大的生成模型,能够从复杂数据中生成高质量样本。尽管其在无限数据下的行为已得到充分理解,但在有限数据下的泛化特性仍不甚明晰。经典学习理论预测泛化发生在样本复杂度随维度指数增长的情况下,这远超实际需求。我们通过数据协方差谱的视角分析扩散模型来弥合这一差距——数据协方差谱常遵循幂律衰减,反映了真实数据的层次结构。为探究这种层次结构是否能促进扩散模型的学习,我们基于线性神经网络建立了理论框架,该框架与数据的高斯假设相一致。我们量化了数据中方差的分层组织及正则化如何影响泛化。研究发现存在两种机制:当 $N < d$ 时,训练数据未包含所有变异方向,导致训练损失与测试损失间存在较大差距。在此机制下,我们证明了强层次化数据结构、正则化及早期停止如何有助于防止过拟合。当 $N > d$ 时,我们发现线性扩散模型的采样分布以 $d/N$ 的线性速率逼近其最优解(以Kullback-Leibler散度度量),且该速率与数据分布的具体细节无关。本研究阐明了在基于扩散的生成模型这一简单模型中,样本复杂度如何支配泛化行为。