The problem of model collapse has presented new challenges in iterative training of generative models, where such training with synthetic data leads to an overall degradation of performance. This paper looks at the problem from a statistical viewpoint, illustrating that one can actually hope for improvement when models are trained on data contaminated with synthetic samples, as long as there is some amount of fresh information from the true target distribution. In particular, we consider iterative training on samples sourced from a mixture of the true target and synthetic distributions. We analyze the entire iterative evolution in a next-token prediction language model, capturing how the interplay between the mixture weights and the sample size controls the overall long-term performance. With non-trivial mixture weight of the true distribution, even if it decays over time, simply training the model in a contamination-agnostic manner with appropriate sample sizes can avoid collapse and even recover the true target distribution under certain conditions. Simulation studies support our findings and also show that such behavior is more general for other classes of models.
翻译:模型崩溃问题为生成模型的迭代训练带来了新的挑战,其中使用合成数据进行此类训练会导致性能的整体退化。本文从统计学的角度审视该问题,阐明当模型在受合成样本污染的数据上进行训练时,只要存在来自真实目标分布的新鲜信息,实际上有望实现性能改进。具体而言,我们考虑对来自真实目标分布与合成分布混合体的样本进行迭代训练。我们在一个基于下一词预测的语言模型中分析了完整的迭代演化过程,揭示了混合权重与样本量之间的相互作用如何控制整体的长期性能。当真实分布具有非平凡的混合权重时,即使该权重随时间衰减,只要以适当样本量进行与污染无关的简单模型训练,即可避免崩溃,并在特定条件下甚至能恢复真实目标分布。仿真研究支持了我们的发现,并表明此类行为对于其他模型类别具有更广泛的普适性。