Knowledge distillation is a model compression technique in which a compact "student" network is trained to replicate the predictive behavior of a larger "teacher" network. In logit-based knowledge distillation, it has become the de facto approach to augment cross-entropy with a distillation term. Typically, this term is either a KL divergence that matches marginal probabilities or a correlation-based loss that captures intra- and inter-class relationships. In every case, it acts as an additional term to cross-entropy. This term has its own weight, which must be carefully tuned. In this paper, we adopt a choice-theoretic perspective and recast knowledge distillation under the Plackett-Luce model by interpreting teacher logits as "worth" scores. We introduce "Plackett-Luce Distillation (PLD)", a weighted list-wise ranking loss. In PLD, the teacher model transfers knowledge of its full ranking of classes, weighting each ranked choice by its own confidence. PLD directly optimizes a single "teacher-optimal" ranking. The true label is placed first, followed by the remaining classes in descending teacher confidence. This process yields a convex and translation-invariant surrogate that subsumes weighted cross-entropy. Empirically, across CIFAR-100, ImageNet-1K, and MS-COCO, PLD achieves consistent gains across diverse architectures and distillation objectives, including divergence-based, correlation-based, and feature-based methods, in both homogeneous and heterogeneous teacher-student pairs.
翻译:知识蒸馏是一种模型压缩技术,通过训练紧凑的“学生”网络来复现较大“教师”网络的预测行为。在基于逻辑值的知识蒸馏中,主流方法是在交叉熵损失基础上增加蒸馏项。该蒸馏项通常为匹配边缘概率的KL散度,或捕捉类内与类间关系的基于相关性的损失函数。无论何种形式,它都作为交叉熵的附加项存在,并具有需要精细调节的独立权重系数。本文从选择理论视角出发,基于Plackett-Luce模型重新构建知识蒸馏框架,将教师网络的逻辑值解释为“价值”评分。我们提出“Plackett-Luce蒸馏(PLD)”——一种加权的列表式排序损失函数。在PLD中,教师模型传递其对所有类别的完整排序知识,并依据自身置信度对每个排序选择进行加权。PLD直接优化单一的“教师最优”排序:真实标签置于首位,其余类别按教师置信度降序排列。该方法产生了一个凸且平移不变的代理函数,其数学形式可涵盖加权交叉熵。在CIFAR-100、ImageNet-1K和MS-COCO数据集上的实验表明,PLD在多样化网络架构与蒸馏目标(包括基于散度、基于相关性和基于特征的方法)中均取得稳定性能提升,且适用于同构与异构的师生网络组合。