Language models (LMs) have been widely used to generate text on the Internet. The generated text is often collected into the training corpus of the next generations of LMs. Previous work has experimentally found that LMs collapse when trained on recursively generated text. This paper contributes to existing knowledge from two aspects. We present a theoretical proof of LM collapse. Our proof reveals the cause of LM collapse and proves that all auto-regressive LMs will definitely collapse. We present a new finding: the performance of LMs gradually declines when trained on recursively generated text until they perform no better than a randomly initialized LM. The trained LMs produce large amounts of repetitive text and perform poorly across a wide range of natural language tasks. The above proof and new findings deepen our understanding of LM collapse and offer valuable insights that may inspire new training techniques to mitigate this threat.
翻译:语言模型(LMs)已被广泛用于在互联网上生成文本。这些生成的文本常被收集到下一代语言模型的训练语料库中。先前的研究通过实验发现,语言模型在递归生成的文本上训练时会崩溃。本文从两个方面对现有知识做出贡献。我们提出了语言模型崩溃的理论证明。我们的证明揭示了语言模型崩溃的原因,并证明所有自回归语言模型都必定会崩溃。我们提出了一项新发现:语言模型在递归生成的文本上训练时,其性能会逐渐下降,直至表现不优于随机初始化的语言模型。训练后的语言模型会产生大量重复文本,并在广泛的自然语言任务中表现不佳。上述证明和新发现加深了我们对语言模型崩溃的理解,并提供了有价值的见解,可能启发新的训练技术以缓解这一威胁。