Stable Diffusion revolutionised image creation from descriptive text. GPT-2, GPT-3(.5) and GPT-4 demonstrated astonishing performance across a variety of language tasks. ChatGPT introduced such language models to the general public. It is now clear that large language models (LLMs) are here to stay, and will bring about drastic change in the whole ecosystem of online text and images. In this paper we consider what the future might hold. What will happen to GPT-{n} once LLMs contribute much of the language found online? We find that use of model-generated content in training causes irreversible defects in the resulting models, where tails of the original content distribution disappear. We refer to this effect as Model Collapse and show that it can occur in Variational Autoencoders, Gaussian Mixture Models and LLMs. We build theoretical intuition behind the phenomenon and portray its ubiquity amongst all learned generative models. We demonstrate that it has to be taken seriously if we are to sustain the benefits of training from large-scale data scraped from the web. Indeed, the value of data collected about genuine human interactions with systems will be increasingly valuable in the presence of content generated by LLMs in data crawled from the Internet.
翻译:稳定扩散模型彻底革新了基于文本描述生成图像的技术。GPT-2、GPT-3(.5)与GPT-4在各类语言任务中展现出惊人性能,而ChatGPT则将这类语言模型带入公众视野。如今大型语言模型(LLM)的持续发展已成定局,并将深刻改变整个在线文本与图像生态系统。本文展望未来可能面临的挑战:当LLM贡献了互联网上绝大多数语言内容后,GPT-{n} 将何去何从?研究发现,在训练过程中使用模型生成的内容会导致所训练模型出现不可逆的缺陷,原始内容分布的尾部特征将逐渐消失。我们将此效应称为"模型崩溃",并证明该现象广泛存在于变分自编码器、高斯混合模型及LLM中。我们构建了该现象的理论直觉模型,并揭示其对所有生成式学习模型具有普遍性。研究表明,若要维持从网络海量数据中训练模型所取得的成效,就必须严肃对待该问题。事实上,当互联网爬取数据中充斥着LLM生成内容时,关于人类与系统真实交互的数据价值将愈发凸显。