Can Generative Artificial Intelligence Survive Data Contamination? Theoretical Guarantees under Contaminated Recursive Training

Generative Artificial Intelligence (AI), such as large language models (LLMs), has become a transformative force across science, industry, and society. As these systems grow in popularity, web data becomes increasingly interwoven with this AI-generated material and it is increasingly difficult to separate them from naturally generated content. As generative models are updated regularly, later models will inevitably be trained on mixtures of human-generated data and AI-generated data from earlier versions, creating a recursive training process with data contamination. Existing theoretical work has examined only highly simplified settings, where both the real data and the generative model are discrete or Gaussian, where it has been shown that such recursive training leads to model collapse. However, real data distributions are far more complex, and modern generative models are far more flexible than Gaussian and linear mechanisms. To fill this gap, we study recursive training in a general framework with minimal assumptions on the real data distribution and allow the underlying generative model to be a general universal approximator. In this framework, we show that contaminated recursive training still converges, with a convergence rate equal to the minimum of the baseline model's convergence rate and the fraction of real data used in each iteration. To the best of our knowledge, this is the first (positive) theoretical result on recursive training without distributional assumptions on the data. We further extend the analysis to settings where sampling bias is present in data collection and support all theoretical results with empirical studies.

翻译：生成式人工智能（如大型语言模型）已成为科学、工业和社会领域的变革性力量。随着这些系统的日益普及，网络数据与人工智能生成内容愈发交织，将其与自然生成内容分离变得愈加困难。由于生成模型定期更新，后续模型将不可避免地使用人类生成数据与早期版本人工智能生成数据的混合数据进行训练，从而形成数据污染的递归训练过程。现有理论研究仅考察了高度简化的场景，即真实数据和生成模型均为离散或高斯分布，研究表明此类递归训练会导致模型崩溃。然而，真实数据分布远为复杂，现代生成模型也远比高斯和线性机制更具灵活性。为填补这一空白，我们在对真实数据分布假设极少的通用框架中研究递归训练，并允许基础生成模型为通用万能逼近器。在此框架下，我们证明污染递归训练仍然收敛，其收敛速率等于基线模型收敛速率与每轮迭代所用真实数据比例的最小值。据我们所知，这是首个关于无数据分布假设的递归训练的（积极）理论结果。我们进一步将分析扩展到数据收集中存在抽样偏差的场景，并通过实证研究验证所有理论结果。