Conventional knowledge distillation, designed for model compression, fails on long-tailed distributions because the teacher model tends to be biased toward head classes and provides limited supervision for tail classes. We propose Long-Tailed Knowledge Distillation (LTKD), a novel framework that reformulates the conventional objective into two components: a cross-group loss, capturing mismatches in prediction distributions across class groups (head, medium, and tail), and a within-group loss, capturing discrepancies within each group's distribution. This decomposition reveals the specific sources of the teacher's bias. To mitigate the inherited bias, LTKD introduces (1) a rebalanced cross-group loss that calibrates the teacher's group-level predictions and (2) a reweighted within-group loss that ensures equal contribution from all groups. Extensive experiments on CIFAR-100-LT, TinyImageNet-LT, and ImageNet-LT demonstrate that LTKD significantly outperforms existing methods in both overall and tail-class accuracy, thereby showing its ability to distill balanced knowledge from a biased teacher for real-world applications.
翻译:传统的知识蒸馏方法专为模型压缩设计,在长尾分布上效果不佳,因为教师模型往往偏向头部类别,对尾部类别的监督有限。我们提出长尾知识蒸馏(LTKD),这是一种新颖的框架,它将传统目标重新表述为两个组成部分:跨组损失,用于捕捉跨类别组(头部、中部和尾部)的预测分布不匹配;以及组内损失,用于捕捉每个组内分布的差异。这种分解揭示了教师模型偏见的特定来源。为了减轻所继承的偏见,LTKD引入了(1)一种重新平衡的跨组损失,用于校准教师模型的组级预测,以及(2)一种重新加权的组内损失,确保所有组具有同等的贡献。在CIFAR-100-LT、TinyImageNet-LT和ImageNet-LT上进行的大量实验表明,LTKD在整体准确率和尾部类别准确率上均显著优于现有方法,从而展示了其从有偏教师中蒸馏平衡知识以用于实际应用的能力。