The goal of unbiased learning to rank (ULTR) is to leverage implicit user feedback for optimizing learning-to-rank systems. Among existing solutions, automatic ULTR algorithms that jointly learn user bias models (i.e., propensity models) with unbiased rankers have received a lot of attention due to their superior performance and low deployment cost in practice. Despite their theoretical soundness, the effectiveness is usually justified under a weak logging policy, where the ranking model can barely rank documents according to their relevance to the query. However, when the logging policy is strong, e.g., an industry-deployed ranking policy, the reported effectiveness cannot be reproduced. In this paper, we first investigate ULTR from a causal perspective and uncover a negative result: existing ULTR algorithms fail to address the issue of propensity overestimation caused by the query-document relevance confounder. Then, we propose a new learning objective based on backdoor adjustment and highlight its differences from conventional propensity models, which reveal the prevalence of propensity overestimation. On top of that, we introduce a novel propensity model called Logging-Policy-aware Propensity (LPP) model and its distinctive two-step optimization strategy, which allows for the joint learning of LPP and ranking models within the automatic ULTR framework, and actualize the unconfounded propensity estimation for ULTR. Extensive experiments on two benchmarks demonstrate the effectiveness and generalizability of the proposed method.
翻译:无偏学习排序(ULTR)的目标是利用隐式用户反馈优化学习排序系统。在现有解决方案中,自动ULTR算法通过联合学习用户偏差模型(即倾向模型)与无偏排序器,因其优越性能和低部署成本而备受关注。尽管这些算法在理论上具有合理性,但其有效性通常仅在弱日志策略下得到验证——即排序模型几乎无法根据文档与查询的相关性进行排序。然而,当日志策略较强(例如工业部署的排序策略)时,报告的有效性无法复现。本文首先从因果视角研究ULTR,揭示了一个负面结果:现有ULTR算法未能解决由查询-文档相关混杂因素导致的倾向过度估计问题。随后,我们提出基于后门调整的新学习目标,并强调其与传统倾向模型的差异,揭示了倾向过度估计的普遍性。在此基础上,我们引入了一种名为日志策略感知倾向(LPP)模型的新型倾向模型及其独特的两步优化策略,该策略允许在自动ULTR框架内联合学习LPP和排序模型,并实现了ULTR的无混倾向估计。在两个基准上的大量实验证明了所提方法的有效性和泛化能力。