Diffusion models power leading generative AI, but when and how they memorize training data, especially on low-dimensional manifolds, remains unclear. We find memorization emerges gradually, not abruptly: as data become scarce, diffusion models experience a smooth collapse where their capacity to vary across independent directions diminishes. Measuring latent dimensionality via the learned score field, we reveal how generative behavior increasingly centers on a few examples while other variations "freeze out". We propose a geometric memorization theory, showing that salient features collapse first, then finer details, leading to near point-wise replication. This mirrors physical systems condensing into a few low-energy configurations. Our theoretical predictions align with both synthetic and real data, identifying geometric memorization as a distinct phase between generalization and exact copying.
翻译:扩散模型驱动着领先的生成式人工智能,但其在何时以及如何记忆训练数据,尤其是在低维流形上的记忆机制,仍不明确。我们发现记忆现象是逐渐而非突然出现的:随着数据变得稀缺,扩散模型经历一种平滑的坍缩过程,其沿独立方向变化的能力逐渐减弱。通过基于学习到的评分场测量潜在维度,我们揭示了生成行为如何日益集中于少数样本,而其他变化则逐渐“冻结”。我们提出了一种几何记忆理论,表明显著特征首先坍缩,随后是精细细节,最终导致近乎逐点的复制。这类似于物理系统凝聚成少数低能态构型的过程。我们的理论预测在合成数据与真实数据中均得到验证,从而将几何记忆识别为介于泛化与精确复制之间的一个独特阶段。