We theoretically investigate the phenomena of generalization and memorization in diffusion models. Empirical studies suggest that these phenomena are influenced by model complexity and the size of the training dataset. In our experiments, we further observe that the number of noise samples per data sample ($m$) used during Denoising Score Matching (DSM) plays a significant and non-trivial role. We capture these behaviors and shed insights into their mechanisms by deriving asymptotically precise expressions for test and train errors of DSM under a simple theoretical setting. The score function is parameterized by random features neural networks, with the target distribution being $d$-dimensional Gaussian. We operate in a regime where the dimension $d$, number of data samples $n$, and number of features $p$ tend to infinity while keeping the ratios $\psi_n=\frac{n}{d}$ and $\psi_p=\frac{p}{d}$ fixed. By characterizing the test and train errors, we identify regimes of generalization and memorization as a function of $\psi_n,\psi_p$, and $m$. Our theoretical findings are consistent with the empirical observations.
翻译:我们从理论上研究了扩散模型中的泛化与记忆现象。实证研究表明,这些现象受模型复杂度与训练数据集规模的影响。在实验中,我们进一步观察到,在去噪分数匹配过程中每个数据样本使用的噪声样本数量($m$)起着重要且非平凡的作用。通过推导简单理论设定下DSM测试误差与训练误差的渐近精确表达式,我们捕捉了这些行为并揭示了其内在机制。得分函数由随机特征神经网络参数化,目标分布为$d$维高斯分布。我们在维度$d$、数据样本数$n$与特征数$p$趋于无穷,同时保持比值$\psi_n=\frac{n}{d}$和$\psi_p=\frac{p}{d}$固定的体系下进行分析。通过刻画测试误差与训练误差,我们确定了以$\psi_n$、$\psi_p$和$m$为函数的泛化与记忆机制区域。我们的理论发现与实证观测结果一致。