Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data

Matthias Gerstgrasser,Rylan Schaeffer,Apratim Dey,Rafael Rafailov,Henry Sleight,John Hughes,Tomasz Korbak,Rajashree Agrawal,Dhruv Pai,Andrey Gromov,Daniel A. Roberts,Diyi Yang,David L. Donoho,Sanmi Koyejo

The proliferation of generative models, combined with pretraining on web-scale data, raises a timely question: what happens when these models are trained on their own generated outputs? Recent investigations into model-data feedback loops discovered that such loops can lead to model collapse, a phenomenon where performance progressively degrades with each model-fitting iteration until the latest model becomes useless. However, several recent papers studying model collapse assumed that new data replace old data over time rather than assuming data accumulate over time. In this paper, we compare these two settings and show that accumulating data prevents model collapse. We begin by studying an analytically tractable setup in which a sequence of linear models are fit to the previous models' predictions. Previous work showed if data are replaced, the test error increases linearly with the number of model-fitting iterations; we extend this result by proving that if data instead accumulate, the test error has a finite upper bound independent of the number of iterations. We next empirically test whether accumulating data similarly prevents model collapse by pretraining sequences of language models on text corpora. We confirm that replacing data does indeed cause model collapse, then demonstrate that accumulating data prevents model collapse; these results hold across a range of model sizes, architectures and hyperparameters. We further show that similar results hold for other deep generative models on real data: diffusion models for molecule generation and variational autoencoders for image generation. Our work provides consistent theoretical and empirical evidence that data accumulation mitigates model collapse.

翻译：生成式模型的广泛使用，以及基于网络规模数据的预训练，引发了一个现实问题：当这些模型在其自身生成输出上训练时会发生什么？近期对模型-数据反馈循环的研究发现，此类循环可能导致模型崩塌现象——每次模型拟合迭代后性能逐渐退化，直至最新模型完全失效。然而，部分近期研究在探讨模型崩塌时假设新数据会随时间替换旧数据，而非数据随时间不断累积。本文对比了这两种设定，并证明数据累积可防止模型崩塌。我们首先从一个可解析处理的设置入手：在此设置中，一系列线性模型依次拟合前序模型的预测结果。先前研究表明，若采用数据替换方式，测试误差会随模型拟合迭代次数线性增长；我们通过证明在数据累积方案下，测试误差存在一个与迭代次数无关的有限上界，从而扩展了这一结论。随后，我们通过在文本语料上预训练一系列语言模型的实验，实证检验了数据累积是否也能防止模型崩塌。我们证实数据替换确实会导致模型崩塌，并进一步证明数据累积可防止该现象——此结论在多种模型规模、架构和超参数下均成立。我们还证明，在真实数据上的其他深度生成模型也呈现类似结果：用于分子生成的扩散模型与用于图像生成的变分自编码器。本研究提供了一致的理论与实证证据，表明数据累积可以缓解模型崩塌。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日