This paper tackles the emerging challenge of training generative models within a self-consuming loop, wherein successive generations of models are recursively trained on mixtures of real and synthetic data from previous generations. We construct a theoretical framework to rigorously evaluate how this training procedure impacts the data distributions learned by future models, including parametric and non-parametric models. Specifically, we derive bounds on the total variation (TV) distance between the synthetic data distributions produced by future models and the original real data distribution under various mixed training scenarios for diffusion models with a one-hidden-layer neural network score function. Our analysis demonstrates that this distance can be effectively controlled under the condition that mixed training dataset sizes or proportions of real data are large enough. Interestingly, we further unveil a phase transition induced by expanding synthetic data amounts, proving theoretically that while the TV distance exhibits an initial ascent, it declines beyond a threshold point. Finally, we present results for kernel density estimation, delivering nuanced insights such as the impact of mixed data training on error propagation.
翻译:本文探讨了在自消耗循环中训练生成模型这一新兴挑战,其中后续世代的模型递归地基于真实数据与先前世代生成数据的混合进行训练。我们构建了一个理论框架,以严格评估此训练过程如何影响未来模型学习到的数据分布,包括参数化和非参数化模型。具体而言,我们针对具有单隐藏层神经网络评分函数的扩散模型,在不同混合训练场景下,推导了未来模型生成的合成数据分布与原始真实数据分布之间总变差(TV)距离的界。我们的分析表明,在混合训练数据集规模或真实数据比例足够大的条件下,该距离可以得到有效控制。有趣的是,我们进一步揭示了由合成数据量增加所引发的相变,从理论上证明总变差距离在初始阶段呈现上升趋势,但超过一个阈值点后开始下降。最后,我们给出了核密度估计的结果,提供了诸如混合数据训练对误差传播影响等细致见解。