We theoretically investigate the phenomena of generalization and memorization in diffusion models. Empirical studies suggest that these phenomena are influenced by model complexity and the size of the training dataset. In our experiments, we further observe that the number of noise samples per data sample ($m$) used during Denoising Score Matching (DSM) plays a significant and non-trivial role. We capture these behaviors and shed insights into their mechanisms by deriving asymptotically precise expressions for test and train errors of DSM under a simple theoretical setting. The score function is parameterized by random features neural networks, with the target distribution being $d$-dimensional Gaussian. We operate in a regime where the dimension $d$, number of data samples $n$, and number of features $p$ tend to infinity while keeping the ratios $ψ_n=\frac{n}{d}$ and $ψ_p=\frac{p}{d}$ fixed. By characterizing the test and train errors, we identify regimes of generalization and memorization as a function of $ψ_n,ψ_p$, and $m$. Our theoretical findings are consistent with the empirical observations.
翻译:我们从理论上研究了扩散模型中的泛化与记忆现象。实证研究表明,这些现象受模型复杂度和训练数据集大小的影响。在我们的实验中,我们进一步观察到,去噪分数匹配(DSM)过程中每个数据样本使用的噪声样本数量($m$)起着显著且非平凡的作用。通过在简单理论设置下推导出DSM测试误差和训练误差的渐近精确表达式,我们捕捉到了这些行为并揭示了其机制。分数函数由随机特征神经网络参数化,目标分布为$d$维高斯分布。我们研究的范围是维度$d$、数据样本数$n$和特征数$p$趋于无穷大,同时保持比率$\psi_n=\frac{n}{d}$和$\psi_p=\frac{p}{d}$固定。通过刻画测试误差和训练误差,我们确定了泛化与记忆随$\psi_n$、$\psi_p$和$m$变化的区域。我们的理论结果与实证观测一致。