Calibration-compatible Listwise Distillation of Privileged Features for CTR Prediction

In machine learning systems, privileged features refer to the features that are available during offline training but inaccessible for online serving. Previous studies have recognized the importance of privileged features and explored ways to tackle online-offline discrepancies. A typical practice is privileged features distillation (PFD): train a teacher model using all features (including privileged ones) and then distill the knowledge from the teacher model using a student model (excluding the privileged features), which is then employed for online serving. In practice, the pointwise cross-entropy loss is often adopted for PFD. However, this loss is insufficient to distill the ranking ability for CTR prediction. First, it does not consider the non-i.i.d. characteristic of the data distribution, i.e., other items on the same page significantly impact the click probability of the candidate item. Second, it fails to consider the relative item order ranked by the teacher model's predictions, which is essential to distill the ranking ability. To address these issues, we first extend the pointwise-based PFD to the listwise-based PFD. We then define the calibration-compatible property of distillation loss and show that commonly used listwise losses do not satisfy this property when employed as distillation loss, thus compromising the model's calibration ability, which is another important measure for CTR prediction. To tackle this dilemma, we propose Calibration-compatible LIstwise Distillation (CLID), which employs carefully-designed listwise distillation loss to achieve better ranking ability than the pointwise-based PFD while preserving the model's calibration ability. We theoretically prove it is calibration-compatible. Extensive experiments on public datasets and a production dataset collected from the display advertising system of Alibaba further demonstrate the effectiveness of CLID.

翻译：在机器学习系统中，特权特征指离线训练时可用但在线服务中无法获取的特征。已有研究认识到特权特征的重要性并探索了解决在线-离线不一致性的方法。典型做法是特权特征蒸馏（PFD）：使用所有特征（包括特权特征）训练教师模型，再通过学生模型（不含特权特征）蒸馏教师模型的知识，最终用于在线服务。实践中通常采用逐点交叉熵损失进行PFD，但该损失不足以蒸馏CTR预测的排序能力。首先，它未考虑数据分布的非独立同分布特性（即同页面其他物品会显著影响候选物品的点击概率）；其次，它忽略了教师模型预测所定义的物品相对次序，而这对于蒸馏排序能力至关重要。针对这些问题，我们首先将基于逐点的PFD扩展为基于列式的PFD，继而定义蒸馏损失的校准兼容性属性，并证明常用列式损失作为蒸馏损失时不满足该属性，从而损害模型校准能力——这是CTR预测的另一重要指标。为解决这一困境，我们提出校准兼容列式蒸馏（CLID），采用精心设计的列式蒸馏损失，在保持模型校准能力的同时获得优于逐点PFD的排序能力。我们理论证明了其校准兼容性。在公开数据集及阿里巴巴展示广告系统生产数据集上的大量实验进一步验证了CLID的有效性。