Iterative self-training (self-distillation) repeatedly refits a model on pseudo-labels generated by its own predictions. We study this procedure in overparameterized linear regression: an initial estimator is trained on noisy labels, and each subsequent iterate is trained on fresh covariates with noiseless pseudo-labels from the previous model. In the high-dimensional regime, we derive deterministic-equivalent recursions for the prediction risk and effective noise across iterations, and prove that the empirical quantities concentrate sharply around these limits. The recursion separates two competing forces: a systematic component that grows with iteration due to progressive signal forgetting, and a stochastic component that decays due to denoising via repeated data-dependent projections. Their interaction yields a $U$-shaped test-risk curve and an optimal early-stopping time. In spiked covariance models, iteration further acts as an iteration-dependent spectral filter that preserves strong eigendirections while suppressing weaker ones, inducing an implicit form of soft feature selection distinct from ridge regression. Finally, we propose an iterated generalized cross-validation criterion and prove its uniform consistency for estimating the risk along the self-training trajectory, enabling fully data-driven selection of the stopping time and regularization. Experiments on synthetic covariances validate the theory and illustrate the predicted denoising-forgetting trade-off.
翻译:迭代自训练(自蒸馏)通过在模型自身预测生成的伪标签上反复拟合模型。我们在过参数化线性回归中研究这一过程:初始估计器在带噪声的标签上训练,每个后续迭代则在新的协变量上训练,并使用前一个模型生成的无噪声伪标签。在高维体系下,我们推导了预测风险与有效噪声随迭代次数的确定性等价递归,并证明经验量会急剧集中收敛于这些极限。该递归揭示了两股相互竞争的力量:因渐进式信号遗忘而随迭代增长的系统性分量,以及通过重复数据依赖投影进行去噪而衰减的随机分量。二者的相互作用形成了U形测试风险曲线及最优早停时刻。在尖峰协方差模型中,迭代进一步充当了迭代依赖的谱滤波器,在保留强特征方向的同时抑制弱特征方向,从而产生一种与岭回归不同的隐式软特征选择机制。最后,我们提出一种迭代广义交叉验证准则,并证明其在整个自训练轨迹上对风险估计具有一致收敛性,实现了完全数据驱动的停止时间与正则化参数选择。在合成协方差数据上的实验验证了理论结果,并阐明了预测的去噪-遗忘权衡关系。