Knowledge distillation is widely used to improve generalization in practice, yet its theoretical understanding remains elusive. In the standard distillation setting, a teacher model provides soft predictions to guide the training of a student model. We model teacher and student training as coupled stochastic processes and introduce a distillation divergence, defined as the Kullback-Leibler divergence between these two stochastic kernels. Within this framework, we derive two generalization bounds for the student model relative to the teacher's generalization gap: an upper bound under a sub-Gaussian assumption via algorithmic stability, and a lower bound under a central condition with sharper dependence on the distillation divergence. We further develop a loss-sharpness-aware bound with an explicit tightness regime, showing that the teacher's local flatness can strictly tighten the bound. Additionally, in a linear Gaussian case study, the distillation divergence admits an interpretable decomposition into bias, variance, and rank-bottleneck costs, yielding practical guidance for distillation design.
翻译:知识蒸馏在实践中被广泛用于提升泛化性能,但其理论理解仍不清晰。在标准蒸馏设置中,教师模型提供软预测来指导学生模型的训练。我们将教师和学生训练建模为耦合随机过程,并引入蒸馏散度,定义为这两个随机核之间的KL散度。在此框架下,我们推导出学生模型相对于教师泛化差距的两个泛化界:基于算法稳定性的次高斯假设下的上界,以及在中心条件下对蒸馏散度具有更敏锐依赖的下界。我们进一步发展了一个具有显式紧致性区间的损失锐度感知界,表明教师的局部平坦性可以严格收紧该界。此外,在线性高斯案例研究中,蒸馏散度可分解为偏差、方差和秩瓶颈成本,为蒸馏设计提供了实用指导。