Knowledge Distillation (KD) has been widely used to improve the quality of latency sensitive models serving live traffic. However, applying KD in production recommender systems with low traffic is challenging: the limited amount of data restricts the teacher model size, and the cost of training a large dedicated teacher may not be justified. Cross-domain KD offers a cost-effective alternative by leveraging a teacher from a data-rich source domain, but introduces unique technical difficulties, as the features, user interfaces, and prediction tasks can significantly differ. We present a case study of using zero-shot cross-domain KD for multi-task ranking models, transferring knowledge from a (100x) large-scale video recommendation platform (YouTube) to a music recommendation application with significantly lower traffic. We share offline and live experiment results and present findings evaluating different KD techniques in this setting across two ranking models on the music app. Our results demonstrate that zero-shot cross-domain KD is a practical and effective approach to improve the performance of ranking models on low traffic surfaces.
翻译:知识蒸馏(KD)已被广泛用于提升服务于在线流量的延迟敏感模型的质量。然而,在低流量的生产推荐系统中应用KD面临挑战:有限的数据量限制了教师模型的规模,且训练大型专用教师的成本可能不合理。跨域KD通过利用数据丰富源域中的教师提供了一种经济高效的替代方案,但由于特征、用户界面及预测任务可能存在显著差异,引入了独特的技术难题。本文以零样本跨域KD在多任务排序模型中的应用为案例,探讨如何将知识从(100倍规模)视频推荐平台(YouTube)迁移至流量显著更低的音乐推荐应用。我们分享了离线与在线实验的结果,并针对音乐应用的两个排序模型,评估了不同KD技术在该场景下的效果。实验结果表明,零样本跨域KD是提升低流量场景下排序模型性能的一种实用且有效的方法。