Model collapse occurs when generative models degrade after repeatedly training on their own synthetic outputs. We study this effect in overparameterized linear regression in a setting where each iteration mixes fresh real labels with synthetic labels drawn from the model fitted in the previous iteration. We derive precise generalization error formulae for minimum-$\ell_2$-norm interpolation and ridge regression under this iterative scheme. Our analysis reveals intriguing properties of the optimal mixing weight that minimizes long-term prediction error and provably prevents model collapse. For instance, in the case of min-$\ell_2$-norm interpolation, we establish that the optimal real-data proportion converges to the reciprocal of the golden ratio for fairly general classes of covariate distributions. Previously, this property was known only for ordinary least squares, and additionally in low dimensions. For ridge regression, we further analyze two popular model classes -- the random-effects model and the spiked covariance model -- demonstrating how spectral geometry governs optimal weighting. In both cases, as well as for isotropic features, we uncover that the optimal mixing ratio should be at least one-half, reflecting the necessity of favoring real-data over synthetic. We study three additional settings: (i) where real data is fixed and fresh labels are not obtained at each iteration, (ii) where covariates vary across iterations but fresh real labels are available each time, and (iii) where covariates vary with time but only a fraction of them receive fresh real labels at each iteration. Across these diverse settings, we characterize when model collapse is inevitable and when synthetic data improves learning. We validate our theoretical results with extensive simulations.
翻译:模型崩溃是指生成模型在反复使用自身合成输出进行训练后性能下降的现象。本文研究过参数化线性回归中该效应,其中每次迭代将新鲜真实标签与从前一次迭代拟合模型中抽取的合成标签混合。我们推导了该迭代方案下最小$\ell_2$范数插值与岭回归的精确泛化误差公式。分析揭示了最小化长期预测误差且可证明防止模型崩溃的最优混合权重具有引人入胜的性质。例如在最小$\ell_2$范数插值情形中,我们证明对于相当广泛的协变量分布类,最优真实数据比例收敛于黄金分割比的倒数。此前该性质仅在普通最小二乘法及低维情形中已知。对于岭回归,我们进一步分析两种常用模型类——随机效应模型与尖峰协方差模型——阐明谱几何如何主导最优加权。在这两种情形及各向同性特征中,我们发现最优混合比例应至少为二分之一,这反映了必须优先使用真实数据而非合成数据。我们研究三种附加设定:(i) 真实数据固定且每次迭代未获得新鲜标签,(ii) 协变量随迭代变化但每次可获得新鲜真实标签,(iii) 协变量随时间变化但每次迭代仅部分协变量获得新鲜真实标签。在这些多样化设定中,我们刻画了模型崩溃必然发生的情形与合成数据能改进学习的情形。我们通过大量仿真验证了理论结果。