Off-policy Learning to Rank (LTR) aims to optimize a ranker from data collected by a deployed logging policy. However, existing off-policy learning to rank methods often make strong assumptions about how users generate the click data, i.e., the click model, and hence need to tailor their methods specifically under different click models. In this paper, we unified the ranking process under general stochastic click models as a Markov Decision Process (MDP), and the optimal ranking could be learned with offline reinforcement learning (RL) directly. Building upon this, we leverage offline RL techniques for off-policy LTR and propose the Click Model-Agnostic Unified Off-policy Learning to Rank (CUOLR) method, which could be easily applied to a wide range of click models. Through a dedicated formulation of the MDP, we show that offline RL algorithms can adapt to various click models without complex debiasing techniques and prior knowledge of the model. Results on various large-scale datasets demonstrate that CUOLR consistently outperforms the state-of-the-art off-policy learning to rank algorithms while maintaining consistency and robustness under different click models.
翻译:离策略学习排序旨在利用部署日志策略收集的数据优化排序器。然而,现有离策略学习排序方法通常对用户生成点击数据的方式(即点击模型)做出强假设,因此需要针对不同点击模型定制具体方法。本文将通用随机点击模型下的排序过程统一为马尔可夫决策过程,从而可通过离线强化学习直接学习最优排序。基于此,我们利用离线强化学习技术解决离策略LTR问题,提出点击模型无关的统一离策略学习排序方法(CUOLR),该方法可轻松适用于广泛的点击模型。通过精心设计的马尔可夫决策过程公式,我们证明了离线强化学习算法无需复杂的去偏技术及模型先验知识即可适应各种点击模型。在多个大规模数据集上的实验结果表明,CUOLR在不同点击模型下保持一致性与鲁棒性的同时,始终优于最先进的离策略学习排序算法。