The proliferation of generative models, combined with pretraining on web-scale data, raises a timely question: what happens when these models are trained on their own generated outputs? Recent investigations into model-data feedback loops discovered that such loops can lead to model collapse, a phenomenon where performance progressively degrades with each model-fitting iteration until the latest model becomes useless. However, several recent papers studying model collapse assumed that new data replace old data over time rather than assuming data accumulate over time. In this paper, we compare these two settings and show that accumulating data prevents model collapse. We begin by studying an analytically tractable setup in which a sequence of linear models are fit to the previous models' predictions. Previous work showed if data are replaced, the test error increases linearly with the number of model-fitting iterations; we extend this result by proving that if data instead accumulate, the test error has a finite upper bound independent of the number of iterations. We next empirically test whether accumulating data similarly prevents model collapse by pretraining sequences of language models on text corpora. We confirm that replacing data does indeed cause model collapse, then demonstrate that accumulating data prevents model collapse; these results hold across a range of model sizes, architectures and hyperparameters. We further show that similar results hold for other deep generative models on real data: diffusion models for molecule generation and variational autoencoders for image generation. Our work provides consistent theoretical and empirical evidence that data accumulation mitigates model collapse.
翻译:生成式模型的广泛使用,以及基于网络规模数据的预训练,引发了一个现实问题:当这些模型在其自身生成输出上训练时会发生什么?近期对模型-数据反馈循环的研究发现,此类循环可能导致模型崩塌现象——每次模型拟合迭代后性能逐渐退化,直至最新模型完全失效。然而,部分近期研究在探讨模型崩塌时假设新数据会随时间替换旧数据,而非数据随时间不断累积。本文对比了这两种设定,并证明数据累积可防止模型崩塌。我们首先从一个可解析处理的设置入手:在此设置中,一系列线性模型依次拟合前序模型的预测结果。先前研究表明,若采用数据替换方式,测试误差会随模型拟合迭代次数线性增长;我们通过证明在数据累积方案下,测试误差存在一个与迭代次数无关的有限上界,从而扩展了这一结论。随后,我们通过在文本语料上预训练一系列语言模型的实验,实证检验了数据累积是否也能防止模型崩塌。我们证实数据替换确实会导致模型崩塌,并进一步证明数据累积可防止该现象——此结论在多种模型规模、架构和超参数下均成立。我们还证明,在真实数据上的其他深度生成模型也呈现类似结果:用于分子生成的扩散模型与用于图像生成的变分自编码器。本研究提供了一致的理论与实证证据,表明数据累积可以缓解模型崩塌。