When Models Don't Collapse: On the Consistency of Iterative MLE

The widespread use of generative models has created a feedback loop, in which each generation of models is trained on data partially produced by its predecessors. This process has raised concerns about model collapse: A critical degradation in performance caused by repeated training on synthetic data. However, different analyses in the literature have reached different conclusions as to the severity of model collapse. As such, it remains unclear how concerning this phenomenon is, and under which assumptions it can be avoided. To address this, we theoretically study model collapse for maximum likelihood estimation (MLE), in a natural setting where synthetic data is gradually added to the original data set. Under standard assumptions (similar to those long used for proving asymptotic consistency and normality of MLE), we establish non-asymptotic bounds showing that collapse can be avoided even as the fraction of real data vanishes. On the other hand, we prove that some assumptions (beyond MLE consistency) are indeed necessary: Without them, model collapse can occur arbitrarily quickly, even when the original data is still present in the training set. To the best of our knowledge, these are the first rigorous examples of iterative generative modeling with accumulating data that rapidly leads to model collapse.

翻译：生成模型的广泛应用已形成一种反馈循环，其中每一代模型都部分基于前代模型生成的数据进行训练。这一过程引发了关于模型崩溃的担忧：即因反复使用合成数据进行训练而导致的性能严重退化。然而，文献中的不同分析对模型崩溃的严重性得出了不同结论。因此，尚不清楚这一现象究竟有多令人担忧，以及在何种假设下可以避免其发生。为解答此问题，我们从理论上研究了在合成数据逐步添加到原始数据集的自然场景下，极大似然估计（MLE）的模型崩溃问题。在标准假设（类似于长期以来用于证明MLE渐近一致性和正态性的假设）下，我们建立了非渐近界限，表明即使在真实数据占比趋近于零时，模型崩溃仍可避免。另一方面，我们证明某些假设（超出MLE一致性）确实是必要的：若缺乏这些假设，即使原始数据仍保留在训练集中，模型崩溃也可能迅速发生。据我们所知，这是首个严格证明迭代生成建模中数据累积可快速导致模型崩溃的实例。

相关内容

极大似然估计

关注 5

极大似然估计方法（Maximum Likelihood Estimate，MLE）也称为最大概似估计或最大似然估计，是求估计的另一种方法，最大概似是1821年首先由德国数学家高斯（C. F. Gauss）提出，但是这个方法通常被归功于英国的统计学家罗纳德·费希尔（R. A. Fisher）它是建立在极大似然原理的基础上的一个统计方法，极大似然原理的直观想法是，一个随机试验如有若干个可能的结果A，B，C，... ，若在一次试验中，结果A出现了，那么可以认为实验条件对A的出现有利，也即出现的概率P(A)较大。极大似然原理的直观想法我们用下面例子说明。设甲箱中有99个白球，1个黑球；乙箱中有1个白球．99个黑球。现随机取出一箱，再从抽取的一箱中随机取出一球，结果是黑球，这一黑球从乙箱抽取的概率比从甲箱抽取的概率大得多，这时我们自然更多地相信这个黑球是取自乙箱的。一般说来，事件A发生的概率与某一未知参数theta有关， theta取值不同，则事件A发生的概率P(A/theta)也不同，当我们在一次试验中事件A发生了，则认为此时的theta值应是t的一切可能取值中使P(A/theta)达到最大的那一个，极大似然估计法就是要选取这样的t值作为参数t的估计值，使所选取的样本在被选的总体中出现的可能性为最大。

【ICML2025】多模态表示坍塌的深度剖析

专知会员服务

15+阅读 · 2025年5月30日

【CMU博士论文】通过对不完美数据的稳健理解与学习推动基础模型的民主化

专知会员服务

11+阅读 · 2025年5月21日

马毅乔丹联合指导，Yaodong Yu伯克利博士论文《可靠表示学习：理论与实践》

专知会员服务

43+阅读 · 2024年5月11日

谷歌最新《大语言模型合成数据的最佳实践和经验教训》

专知会员服务

66+阅读 · 2024年4月17日