The problem of model collapse has presented new challenges in iterative training of generative models, where such training with synthetic data leads to an overall degradation of performance. This paper looks at the problem from a statistical viewpoint, illustrating that one can actually hope for improvement when models are trained on data contaminated with synthetic samples, as long as there is some amount of fresh information from the true target distribution. In particular, we consider iterative training on samples sourced from a mixture of the true target and synthetic distributions. We analyze the entire iterative evolution in a next-token prediction language model, capturing how the interplay between the mixture weights and the sample size controls the overall long-term performance. With non-trivial mixture weight of the true distribution, even if it decays over time, simply training the model in a contamination-agnostic manner with appropriate sample sizes can avoid collapse and even recover the true target distribution under certain conditions. Simulation studies support our findings and also show that such behavior is more general for other classes of models.
翻译:模型崩溃问题在生成模型的迭代训练中提出了新的挑战,其中使用合成数据进行此类训练会导致性能的整体退化。本文从统计学的视角审视该问题,阐明当模型在受合成样本污染的数据上训练时,只要存在来自真实目标分布的新鲜信息,实际上可以期待性能的改进。具体而言,我们考虑对源自真实目标分布与合成分布混合的样本进行迭代训练。我们在一个下一词预测语言模型中分析了整个迭代演化过程,揭示了混合权重与样本量之间的相互作用如何控制整体的长期性能。当真实分布具有非平凡的混合权重时,即使该权重随时间衰减,只要以适当的样本量进行污染无关的训练,就能避免崩溃,并在特定条件下甚至能恢复真实目标分布。模拟研究支持了我们的发现,并表明此类行为对于其他模型类别更具普遍性。