As scaling laws push the training of frontier large language models (LLMs) toward ever-growing data requirements, training pipelines are approaching a regime where much of the publicly available online text may be consumed. At the same time, widespread LLM usage increases the volume of machine-generated content on the web; together, these trends raise the likelihood of generated text re-entering future training corpora, increasing the associated risk of performance degradation often called model collapse. In practice, model developers address this concern through data cleaning, watermarking, synthetic-data policies, or, in some cases, blissful ignorance. However, the problem of model collapse in generative models has not been examined from a learning-theoretic perspective: we study it through the theoretical lens of the language generation in the limit framework, introducing a replay adversary that augments the example stream with the generator's own past outputs. Our main contribution is a fine-grained learning-theoretic characterization of when replay fundamentally limits generation: while replay is benign for the strongest notion of uniform generation, it provably creates separations for the weaker notions of non-uniform generation and generation in the limit. Interestingly, our positive results mirror heuristics widely used in practice, such as data cleaning, watermarking, and output filtering, while our separations show when these ideas can fail.
翻译:随着缩放定律推动前沿大语言模型(LLM)的训练对数据量的需求不断增长,训练流程正逐渐接近一个临界点:大部分公开可用的在线文本可能被耗尽。与此同时,LLM的广泛使用增加了网络上机器生成内容的数量;这些趋势共同提高了生成文本重新进入未来训练语料库的可能性,从而增加了通常被称为模型崩溃的性能退化风险。在实践中,模型开发者通过数据清洗、水印技术、合成数据策略或在某些情况下的选择性忽视来应对这一问题。然而,生成模型中的模型崩溃问题尚未从学习理论的角度进行审视:我们通过极限语言生成框架的理论视角来研究该问题,引入了一个回放对抗者,该对抗者将生成器自身过去的输出作为增广样本注入训练流。我们的主要贡献是对回放何时从根本上限制生成能力进行了细粒度的学习理论刻画:虽然回放对于最强的均匀生成概念是良性的,但可证明它会在较弱的非均匀生成和极限生成概念上造成分离。有趣的是,我们的正向结果反映了实践中广泛使用的启发式方法(如数据清洗、水印和输出过滤),而我们的分离结果则揭示了这些方法可能失效的情形。