In this work, we investigate an intriguing and prevalent phenomenon of diffusion models which we term as "consistent model reproducibility": given the same starting noise input and a deterministic sampler, different diffusion models often yield remarkably similar outputs. We confirm this phenomenon through comprehensive experiments, implying that different diffusion models consistently reach the same data distribution and scoring function regardless of diffusion model frameworks, model architectures, or training procedures. More strikingly, our further investigation implies that diffusion models are learning distinct distributions affected by the training data size. This is supported by the fact that the model reproducibility manifests in two distinct training regimes: (i) "memorization regime", where the diffusion model overfits to the training data distribution, and (ii) "generalization regime", where the model learns the underlying data distribution. Our study also finds that this valuable property generalizes to many variants of diffusion models, including those for conditional use, solving inverse problems, and model fine-tuning. Finally, our work raises numerous intriguing theoretical questions for future investigation and highlights practical implications regarding training efficiency, model privacy, and the controlled generation of diffusion models.
翻译:在本工作中,我们研究了扩散模型一个引人入胜且普遍存在的现象,并将其称为“一致模型可复现性”:给定相同的起始噪声输入和确定性采样器,不同扩散模型往往产生高度相似的输出。通过全面实验,我们证实了这一现象,表明无论扩散模型框架、模型架构或训练过程如何,不同扩散模型始终收敛于相同的数据分布和评分函数。更引人注目的是,进一步研究发现扩散模型实际上在学习受训练数据规模影响的差异化分布。这一结论得到以下事实的支持:模型可复现性体现在两种截然不同的训练范式中:(i) “记忆范式”——扩散模型过拟合训练数据分布,以及(ii) “泛化范式”——模型学习潜在数据分布。我们的研究还发现,这一有价值的特性可推广至多种扩散模型变体,包括条件生成模型、逆问题求解模型及微调模型。最后,本工作为未来研究提出了诸多耐人寻味的理论问题,并揭示了关于训练效率、模型隐私以及扩散模型可控生成方面的实践意义。