In this work, we investigate an intriguing and prevalent phenomenon of diffusion models which we term as "consistent model reproducibility": given the same starting noise input and a deterministic sampler, different diffusion models often yield remarkably similar outputs. We confirm this phenomenon through comprehensive experiments, implying that different diffusion models consistently reach the same data distribution and scoring function regardless of diffusion model frameworks, model architectures, or training procedures. More strikingly, our further investigation implies that diffusion models are learning distinct distributions affected by the training data size. This is supported by the fact that the model reproducibility manifests in two distinct training regimes: (i) "memorization regime", where the diffusion model overfits to the training data distribution, and (ii) "generalization regime", where the model learns the underlying data distribution. Our study also finds that this valuable property generalizes to many variants of diffusion models, including those for conditional use, solving inverse problems, and model fine-tuning. Finally, our work raises numerous intriguing theoretical questions for future investigation and highlights practical implications regarding training efficiency, model privacy, and the controlled generation of diffusion models.
翻译:本文研究扩散模型中一个引人注目且普遍存在的现象,我们称之为“一致的模型再现性”:给定相同的初始噪声输入和确定性采样器,不同扩散模型通常会产生高度相似的输出结果。通过全面实验验证,该现象表明无论扩散模型框架、模型架构或训练流程如何,不同扩散模型始终能收敛到相同的数据分布与评分函数。更值得注意的是,进一步研究揭示扩散模型实际上在学习受训练数据规模影响的差异化分布。这一结论得到两个不同训练阶段的支持:(i)“记忆阶段”,扩散模型过拟合训练数据分布;(ii)“泛化阶段”,模型学习潜在数据分布。本研究还发现,这一重要特性可推广至多种扩散模型变体,包括条件生成模型、逆问题求解模型及微调模型。最后,本研究提出了诸多值得未来探索的理论问题,并揭示了在训练效率、模型隐私以及扩散模型可控生成方面的实践启示。