The widespread use of diffusion models has led to an abundance of AI-generated data, raising concerns about model collapse -- a phenomenon in which recursive iterations of training on synthetic data lead to performance degradation. Prior work primarily characterizes this collapse via variance shrinkage or distribution shift, but these perspectives miss practical manifestations of model collapse. This paper identifies a transition from generalization to memorization during model collapse in diffusion models, where models increasingly replicate training data instead of generating novel content during iterative training on synthetic samples. This transition is directly driven by the declining entropy of the synthetic training data produced in each training cycle, which serves as a clear indicator of model degradation. Motivated by this insight, we propose an entropy-based data selection strategy to mitigate the transition from generalization to memorization and alleviate model collapse. Empirical results show that our approach significantly enhances visual quality and diversity in recursive generation, effectively preventing collapse.
翻译:扩散模型的广泛应用导致AI生成数据激增,引发了对模型崩溃现象的担忧——即在合成数据上进行递归迭代训练会导致性能退化。先前研究主要通过方差收缩或分布偏移来表征这种崩溃,但这些视角忽略了模型崩溃的实际表现。本文发现扩散模型在崩溃过程中存在从泛化到记忆的转变:在合成样本的迭代训练中,模型逐渐复制训练数据而非生成新颖内容。这种转变直接由每个训练周期产生的合成训练数据熵值下降所驱动,可作为模型退化的明确指标。基于此发现,我们提出一种基于熵值的数据选择策略,以缓解从泛化到记忆的转变并减轻模型崩溃。实验结果表明,该方法能显著提升递归生成中的视觉质量与多样性,有效防止崩溃发生。