Off-policy Learning to Rank (LTR) aims to optimize a ranker from data collected by a deployed logging policy. However, existing off-policy learning to rank methods often make strong assumptions about how users generate the click data, i.e., the click model, and hence need to tailor their methods specifically under different click models. In this paper, we unified the ranking process under general stochastic click models as a Markov Decision Process (MDP), and the optimal ranking could be learned with offline reinforcement learning (RL) directly. Building upon this, we leverage offline RL techniques for off-policy LTR and propose the Click Model-Agnostic Unified Off-policy Learning to Rank (CUOLR) method, which could be easily applied to a wide range of click models. Through a dedicated formulation of the MDP, we show that offline RL algorithms can adapt to various click models without complex debiasing techniques and prior knowledge of the model. Results on various large-scale datasets demonstrate that CUOLR consistently outperforms the state-of-the-art off-policy learning to rank algorithms while maintaining consistency and robustness under different click models.
翻译:离策略学习排序(Off-policy LTR)旨在通过已部署的日志策略收集的数据来优化排序器。然而,现有的离策略学习排序方法通常对用户生成点击数据的机制(即点击模型)做出强假设,因此需要根据不同点击模型量身定制具体方法。本文将通用随机点击模型下的排序过程统一表述为马尔可夫决策过程(MDP),并可直接利用离线强化学习(RL)学习最优排序。基于此,我们采用离线强化学习技术处理离策略LTR问题,提出了点击模型无关的统一离策略学习排序方法(CUOLR),该方法可轻易适用于广泛的点击模型。通过专门的MDP形式化设计,我们证明了离线强化学习算法无需复杂的去偏技术及模型先验知识即可适应各种点击模型。在多个大规模数据集上的实验结果表明,CUOLR在不同点击模型下均能保持一致性与鲁棒性,并持续优于当前最先进的离策略学习排序算法。