To reduce the size of recommendation models, there have been many studies on compressing recommendation models using knowledge distillation. In this paper, we decompose recommendation models into three layers, i.e., the input layer, the intermediate layer, and the output layer, and address deficiencies layer by layer. First, previous methods focus only on two layers, neglecting the input layer. Second, in the intermediate layer, existing methods ignore the inconsistency of user preferences induced by the projectors. Third, in the output layer, existing methods use only hard labels rather than soft labels from the teacher. To address these deficiencies, we propose \textbf{M}ulti-layer \textbf{K}nowledge \textbf{D}istillation (MKD), which consists of three components: 1) Distillation with Neighbor-based Knowledge (NKD) utilizes the teacher's knowledge about entities with similar characteristics in the input layer to enable the student to learn robust representations. 2) Distillation with Consistent Preference (CPD) reduces the inconsistency of user preferences caused by projectors in the intermediate layer by two regularization terms. 3) Distillation with Soft Labels (SLD) constructs soft labels in the output layer by considering the predictions of both the teacher and the student. Our extensive experiments show that MKD even outperforms the teacher with one-tenth of the model size.
翻译:为缩小推荐模型规模,已有众多研究采用知识蒸馏技术进行模型压缩。本文创新性地将推荐模型解构为输入层、中间层和输出层三个层次,并逐层解决现有方法的缺陷:首先,此前方法仅关注其中两层而忽视了输入层;其次,中间层现有方法忽略了投影器导致的用户偏好不一致性;第三,输出层现有方法仅使用教师模型的硬标签而非软标签。针对这些不足,我们提出多层知识蒸馏(MKD)框架,包含三个核心组件:1)基于邻域知识蒸馏(NKD)利用输入层中教师模型关于相似特征实体的知识,使学生模型学习到鲁棒的表征;2)一致性偏好蒸馏(CPD)通过两个正则化项减少中间层投影器导致的用户偏好不一致;3)软标签蒸馏(SLD)综合考虑教师与学生模型的预测结果构建输出层软标签。大量实验表明,MKD在模型规模仅为教师模型十分之一的情况下仍能超越教师模型性能。