The widespread use of generative models has created a feedback loop, in which each generation of models is trained on data partially produced by its predecessors. This process has raised concerns about model collapse: A critical degradation in performance caused by repeated training on synthetic data. However, different analyses in the literature have reached different conclusions as to the severity of model collapse. As such, it remains unclear how concerning this phenomenon is, and under which assumptions it can be avoided. To address this, we theoretically study model collapse for maximum likelihood estimation (MLE), in a natural setting where synthetic data is gradually added to the original data set. Under standard assumptions (similar to those long used for proving asymptotic consistency and normality of MLE), we establish non-asymptotic bounds showing that collapse can be avoided even as the fraction of real data vanishes. On the other hand, we prove that some assumptions (beyond MLE consistency) are indeed necessary: Without them, model collapse can occur arbitrarily quickly, even when the original data is still present in the training set. To the best of our knowledge, these are the first rigorous examples of iterative generative modeling with accumulating data that rapidly leads to model collapse.
翻译:生成模型的广泛应用已形成一种反馈循环,其中每一代模型都部分基于前代模型生成的数据进行训练。这一过程引发了关于模型崩溃的担忧:即因反复使用合成数据进行训练而导致的性能严重退化。然而,文献中的不同分析对模型崩溃的严重性得出了不同结论。因此,尚不清楚这一现象究竟有多令人担忧,以及在何种假设下可以避免其发生。为解答此问题,我们从理论上研究了在合成数据逐步添加到原始数据集的自然场景下,极大似然估计(MLE)的模型崩溃问题。在标准假设(类似于长期以来用于证明MLE渐近一致性和正态性的假设)下,我们建立了非渐近界限,表明即使在真实数据占比趋近于零时,模型崩溃仍可避免。另一方面,我们证明某些假设(超出MLE一致性)确实是必要的:若缺乏这些假设,即使原始数据仍保留在训练集中,模型崩溃也可能迅速发生。据我们所知,这是首个严格证明迭代生成建模中数据累积可快速导致模型崩溃的实例。