Knowledge Distillation (KD) transfers knowledge from a large teacher model to a smaller student by aligning their predictive distributions. However, conventional KD formulations - typically based on Kullback-Leibler divergence - assume that the teacher provides reliable soft targets. In practice, teacher predictions are often noisy or overconfident, and existing correction-based approaches rely on ad-hoc heuristics and extensive hyper-parameter tuning, which hinders generalization. We introduce REDistill (Robust Estimator Distillation), a simple yet principled framework grounded in robust statistics. REDistill replaces the standard KD objective with a power divergence loss, a generalization of KL divergence that adaptively downweights unreliable teacher output while preserving informative logit relationships. This formulation provides a unified and interpretable treatment of teacher noise, requires only logits, integrates seamlessly into existing KD pipelines, and incurs negligible computational overhead. Extensive experiments on CIFAR-100 and ImageNet-1k demonstrate that REDistill consistently improves student accuracy in diverse teacher-student architectures. Remarkably, it achieves these gains without model-specific hyper-parameter tuning, underscoring its robustness and strong generalization to unseen teacher-student pairs.
翻译:知识蒸馏(Knowledge Distillation, KD)通过对齐大型教师模型与小型学生模型的预测分布,实现知识从前者向后者的迁移。然而,传统的知识蒸馏方法(通常基于Kullback-Leibler散度)假设教师模型能提供可靠的软目标。实践中,教师模型的预测往往存在噪声或过度自信,而现有的基于校正的方法依赖于临时启发式规则和大量的超参数调优,这限制了其泛化能力。本文提出REDistill(鲁棒估计器蒸馏),这是一个基于鲁棒统计理论的简洁而原理清晰的框架。REDistill用幂散度损失替代标准的知识蒸馏目标函数,该损失是KL散度的广义形式,能够自适应地降低不可靠教师输出的权重,同时保留信息性的逻辑值关系。此公式为教师噪声提供了统一且可解释的处理方案,仅需逻辑值作为输入,可无缝集成到现有的知识蒸馏流程中,且计算开销可忽略不计。在CIFAR-100和ImageNet-1k数据集上的大量实验表明,REDistill在不同教师-学生架构中均能持续提升学生模型的准确率。值得注意的是,该方法无需针对特定模型进行超参数调优即可实现上述性能提升,这凸显了其鲁棒性以及对未见教师-学生组合的强泛化能力。