Self-distillation (SD) is the process of retraining a student on a mixture of ground-truth labels and the teacher's own predictions using the same architecture and training data. Although SD has been empirically shown to often improve generalization, its formal guarantees remain limited. We study SD for ridge regression in unconstrained setting in which the mixing weight $ξ$ may be outside the unit interval. Conditioned on the training data and without any distributional assumptions, we prove that for any squared prediction risk (including out-of-distribution), the optimally mixed student strictly improves upon the ridge teacher for every regularization level $λ> 0$ at which the teacher ridge risk $R(λ)$ is nonstationary (i.e., $R'(λ) \neq 0$). We obtain a closed-form expression for the optimal mixing weight $ξ^\star(λ)$ for any value of $λ$ and show that it obeys the sign rule: $\operatorname{sign}(ξ^\star(λ))=-\operatorname{sign}(R'(λ))$. In particular, $ξ^\star(λ)$ can be negative, which is the case in over-regularized regimes. To quantify the risk improvement due to SD, we derive exact deterministic equivalents for the optimal SD risk in the proportional asymptotics regime (where the sample and feature sizes $n$ and $p$ both diverge but their aspect ratio $p/n$ converges) under general anisotropic covariance and deterministic signals. Our asymptotic analysis extends standard second-order ridge deterministic equivalents to their fourth-order analogs using block linearization, which may be of independent interest. From a practical standpoint, we propose a consistent one-shot tuning method to estimate $ξ^\star$ without grid search, sample splitting, or refitting. Experiments on real-world datasets and pretrained neural network features support our theory and the one-shot tuning method.
翻译:自蒸馏(Self-Distillation, SD)是指使用相同架构和训练数据,以真实标签与教师模型自身预测的混合结果重新训练学生模型的过程。尽管经验表明SD通常能提升泛化性能,但其理论保证仍十分有限。本文研究无约束设置下岭回归的自蒸馏问题,其中混合权重$ξ$可超出单位区间。在给定训练数据且不作任何分布假设的条件下,我们证明:对于任意平方预测风险(包括分布外情况),在教师岭风险$R(λ)$非平稳(即$R'(λ) \neq 0$)的所有正则化水平$λ> 0$处,经最优混合的学生模型均严格优于岭教师模型。我们推导出适用于任意$λ$值的最优混合权重$ξ^\star(λ)$的闭式表达式,并证明其遵循符号法则:$\operatorname{sign}(ξ^\star(λ))=-\operatorname{sign}(R'(λ))$。特别地,$ξ^\star(λ)$可为负值,这在过正则化区域中尤为常见。为量化SD带来的风险改进,我们在比例渐近框架下(样本量$n$与特征维度$p$同时发散但纵横比$p/n$收敛),针对一般各向异性协方差与确定性信号,推导出最优SD风险的精确确定性等价式。通过分块线性化方法,我们将标准的二阶岭确定性等价式扩展至四阶形式,该技术本身可能具有独立研究价值。从实践角度出发,我们提出一种无需网格搜索、样本划分或重新拟合的一致性单次调参方法以估计$ξ^\star$。在真实数据集与预训练神经网络特征上的实验验证了我们的理论及单次调参方法的有效性。