Self-distillation (SD) is the process of training a student model using the outputs of a teacher model, with both models sharing the same architecture. Our study theoretically examines SD in multi-class classification with cross-entropy loss, exploring both multi-round SD and SD with refined teacher outputs, inspired by partial label learning (PLL). By deriving a closed-form solution for the student model's outputs, we discover that SD essentially functions as label averaging among instances with high feature correlations. Initially beneficial, this averaging helps the model focus on feature clusters correlated with a given instance for predicting the label. However, it leads to diminishing performance with increasing distillation rounds. Additionally, we demonstrate SD's effectiveness in label noise scenarios and identify the label corruption condition and minimum number of distillation rounds needed to achieve 100% classification accuracy. Our study also reveals that one-step distillation with refined teacher outputs surpasses the efficacy of multi-step SD using the teacher's direct output in high noise rate regimes.
翻译:自蒸馏(SD)是指使用教师模型输出训练学生模型的过程,且两个模型共享相同的架构。本研究从理论上探讨了交叉熵损失下多类分类中的自蒸馏,并借鉴部分标签学习(PLL)的思想,分析了多轮自蒸馏及基于精炼教师输出的自蒸馏。通过求解学生模型输出的闭式解,我们发现自蒸馏本质上是在具有高特征相关性的实例间进行标签平均。这种平均初始阶段有助于模型聚焦与给定实例相关的特征簇以预测标签,但随着蒸馏轮次增加会导致性能下降。此外,我们证明了自蒸馏在标签噪声场景中的有效性,并确定了实现100%分类准确率所需的标签损坏条件和最小蒸馏轮数。研究还表明,在高噪声率条件下,采用精炼教师输出的一步蒸馏效果优于使用教师直接输出的多步自蒸馏。