The Curse of Recursion: Training on Generated Data Makes Models Forget

Stable Diffusion revolutionised image creation from descriptive text. GPT-2, GPT-3(.5) and GPT-4 demonstrated astonishing performance across a variety of language tasks. ChatGPT introduced such language models to the general public. It is now clear that large language models (LLMs) are here to stay, and will bring about drastic change in the whole ecosystem of online text and images. In this paper we consider what the future might hold. What will happen to GPT-{n} once LLMs contribute much of the language found online? We find that use of model-generated content in training causes irreversible defects in the resulting models, where tails of the original content distribution disappear. We refer to this effect as Model Collapse and show that it can occur in Variational Autoencoders, Gaussian Mixture Models and LLMs. We build theoretical intuition behind the phenomenon and portray its ubiquity amongst all learned generative models. We demonstrate that it has to be taken seriously if we are to sustain the benefits of training from large-scale data scraped from the web. Indeed, the value of data collected about genuine human interactions with systems will be increasingly valuable in the presence of content generated by LLMs in data crawled from the Internet.

翻译：稳定扩散模型彻底革新了基于文本描述生成图像的技术。GPT-2、GPT-3(.5)与GPT-4在各类语言任务中展现出惊人性能，而ChatGPT则将这类语言模型带入公众视野。如今大型语言模型（LLM）的持续发展已成定局，并将深刻改变整个在线文本与图像生态系统。本文展望未来可能面临的挑战：当LLM贡献了互联网上绝大多数语言内容后，GPT-{n} 将何去何从？研究发现，在训练过程中使用模型生成的内容会导致所训练模型出现不可逆的缺陷，原始内容分布的尾部特征将逐渐消失。我们将此效应称为"模型崩溃"，并证明该现象广泛存在于变分自编码器、高斯混合模型及LLM中。我们构建了该现象的理论直觉模型，并揭示其对所有生成式学习模型具有普遍性。研究表明，若要维持从网络海量数据中训练模型所取得的成效，就必须严肃对待该问题。事实上，当互联网爬取数据中充斥着LLM生成内容时，关于人类与系统真实交互的数据价值将愈发凸显。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日