This paper analyzes Cross-Entropy (CE) loss in knowledge distillation (KD) for recommender systems. KD for recommender systems targets at distilling rankings, especially among items most likely to be preferred, and can only be computed on a small subset of items. Considering these features, we reveal the connection between CE loss and NDCG in the field of KD. We prove that when performing KD on an item subset, minimizing CE loss maximizes the lower bound of NDCG, only if an assumption of closure is satisfied. It requires that the item subset consists of the student's top items. However, this contradicts our goal of distilling rankings of the teacher's top items. We empirically demonstrate the vast gap between these two kinds of top items. To bridge the gap between our goal and theoretical support, we propose Rejuvenated Cross-Entropy for Knowledge Distillation (RCE-KD). It splits the top items given by the teacher into two subsets based on whether they are highly ranked by the student. For the subset that defies the condition, a sampling strategy is devised to use teacher-student collaboration to approximate our assumption of closure. We also combine the losses on the two subsets adaptively. Extensive experiments demonstrate the effectiveness of our method. Our code is available at https://github.com/BDML-lab/RCE-KD.
翻译:本文分析了推荐系统中知识蒸馏(KD)中的交叉熵(CE)损失。推荐系统的知识蒸馏旨在提炼排序信息,特别是在最可能被偏好的项目之间,且仅能在项目的小型子集上计算。考虑到这些特性,我们揭示了知识蒸馏领域中CE损失与NDCG之间的关联。我们证明,当在项目子集上执行知识蒸馏时,最小化CE损失会最大化NDCG的下界,但前提是满足封闭性假设。该假设要求项目子集由学生的顶部项目组成。然而,这与我们提炼教师顶部项目排序的目标相矛盾。我们通过实验证明了这两类顶部项目之间存在巨大差距。为弥合目标与理论支持之间的差距,我们提出了用于知识蒸馏的复兴交叉熵(RCE-KD)。该方法根据学生是否高度排序,将教师给出的顶部项目划分为两个子集。对于不满足条件的子集,我们设计了一种采样策略,利用师生协作来近似满足封闭性假设。我们还自适应地结合了两个子集上的损失。大量实验证明了我们方法的有效性。我们的代码可在https://github.com/BDML-lab/RCE-KD获取。