Machine learning models are increasingly trained or fine-tuned on synthetic data. Recursively training on such data has been observed to significantly degrade performance in a wide range of tasks, often characterized by a progressive drift away from the target distribution. In this work, we theoretically analyze this phenomenon in the setting of score-based diffusion models. For a realistic pipeline where each training round uses a combination of synthetic data and fresh samples from the target distribution, we obtain upper and lower bounds on the accumulated divergence between the generated and target distributions. This allows us to characterize different regimes of drift, depending on the score estimation error and the proportion of fresh data used in each generation. We also provide empirical results on synthetic data and images to illustrate the theory.
翻译:机器学习模型越来越多地使用合成数据进行训练或微调。已有观察表明,在此类数据上进行递归训练会显著降低多种任务的性能,其典型特征表现为模型输出逐渐偏离目标分布。本文从理论角度分析了基于分数的扩散模型中这一现象。针对每一轮训练同时使用合成数据与来自目标分布的新鲜样本的现实训练流程,我们推导出生成分布与目标分布之间累积散度的上界与下界。这使得我们能够依据分数估计误差以及每轮生成中使用的新鲜数据比例,刻画不同的分布漂移机制。我们还在合成数据与图像上提供了实证结果以验证理论分析。