Counterfactual learning to rank (CLTR) relies on exposure-based inverse propensity scoring (IPS), a LTR-specific adaptation of IPS to correct for position bias. While IPS can provide unbiased and consistent estimates, it often suffers from high variance. Especially when little click data is available, this variance can cause CLTR to learn sub-optimal ranking behavior. Consequently, existing CLTR methods bring significant risks with them, as naively deploying their models can result in very negative user experiences. We introduce a novel risk-aware CLTR method with theoretical guarantees for safe deployment. We apply a novel exposure-based concept of risk regularization to IPS estimation for LTR. Our risk regularization penalizes the mismatch between the ranking behavior of a learned model and a given safe model. Thereby, it ensures that learned ranking models stay close to a trusted model, when there is high uncertainty in IPS estimation, which greatly reduces the risks during deployment. Our experimental results demonstrate the efficacy of our proposed method, which is effective at avoiding initial periods of bad performance when little data is available, while also maintaining high performance at convergence. For the CLTR field, our novel exposure-based risk minimization method enables practitioners to adopt CLTR methods in a safer manner that mitigates many of the risks attached to previous methods.
翻译:反事实排序学习(CLTR)依赖于基于曝光的逆倾向评分(IPS),这是IPS针对排序任务的特化修正,用于消除位置偏差。尽管IPS能提供无偏且一致的估计,但常受高方差困扰。特别是在点击数据匮乏时,这种方差会导致CLTR学习到次优排序行为。因此,现有CLTR方法存在显著风险——直接部署其模型可能引发极差的用户体验。我们提出一种具有安全部署理论保证的新型风险感知CLTR方法。通过将新颖的基于曝光风险正则化概念应用于排序任务的IPS估计,该正则化会惩罚学习模型与给定安全模型之间的排序行为偏差。当IPS估计存在高不确定性时,它能确保学习排序模型始终接近可信模型,从而大幅降低部署风险。实验结果表明,所提方法能有效规避数据匮乏初期的性能低谷期,同时在收敛阶段保持高性能。对于CLTR领域,这种基于曝光风险最小化的创新方法使从业者能够以更安全的方式应用CLTR方法,显著缓解以往方法伴随的诸多风险。