Recent studies identified an intriguing phenomenon in recursive generative model training known as model collapse, where models trained on data generated by previous models exhibit severe performance degradation. Addressing this issue and developing more effective training strategies have become central challenges in generative model research. In this paper, we investigate this phenomenon within a novel framework, where generative models are iteratively trained on a combination of newly collected real data and synthetic data from the previous training step. To develop an optimal training strategy for integrating real and synthetic data, we evaluate the performance of a weighted training scheme in various scenarios, including Gaussian distribution estimation, generalized linear models, and nonparametric estimation. We theoretically characterize the impact of the mixing proportion and weighting scheme of synthetic data on the final model's performance. Our key finding is that, across different settings, the optimal weighting scheme under different proportions of synthetic data asymptotically follows a unified expression, revealing a fundamental trade-off between leveraging synthetic data and model performance. In some cases, the optimal weight assigned to real data corresponds to the reciprocal of the golden ratio. Finally, we validate our theoretical results on extensive simulated datasets and a real tabular dataset.
翻译:近期研究在递归生成模型训练中观察到一种被称为模型坍塌的有趣现象,即基于先前模型生成数据训练的模型会出现严重的性能退化。解决这一问题并开发更有效的训练策略已成为生成模型研究的核心挑战。本文在一个新型框架下研究该现象:生成模型迭代使用新采集的真实数据与上一训练步骤生成的合成数据进行联合训练。为开发融合真实数据与合成数据的最优训练策略,我们在多种场景下评估加权训练方案的性能,包括高斯分布估计、广义线性模型及非参数估计。我们从理论上刻画了合成数据的混合比例与加权方案对最终模型性能的影响。关键发现是:在不同设定下,合成数据不同比例对应的最优加权方案渐近遵循统一表达式,揭示了利用合成数据与模型性能之间的基本权衡。在某些情况下,真实数据的最优权重对应于黄金分割比的倒数。最后,我们在大量模拟数据集及真实表格数据集上验证了理论结果。