Off-policy Learning to Rank (LTR) aims to optimize a ranker from data collected by a deployed logging policy. However, existing off-policy learning to rank methods often make strong assumptions about how users generate the click data, i.e., the click model, and hence need to tailor their methods specifically under different click models. In this paper, we unified the ranking process under general stochastic click models as a Markov Decision Process (MDP), and the optimal ranking could be learned with offline reinforcement learning (RL) directly. Building upon this, we leverage offline RL techniques for off-policy LTR and propose the Click Model-Agnostic Unified Off-policy Learning to Rank (CUOLR) method, which could be easily applied to a wide range of click models. Through a dedicated formulation of the MDP, we show that offline RL algorithms can adapt to various click models without complex debiasing techniques and prior knowledge of the model. Results on various large-scale datasets demonstrate that CUOLR consistently outperforms the state-of-the-art off-policy learning to rank algorithms while maintaining consistency and robustness under different click models.
翻译:离线策略学习排序旨在通过已部署日志策略收集的数据优化排序模型。然而,现有离线策略学习排序方法通常对用户生成点击数据的机制(即点击模型)做出强烈假设,因此需针对不同点击模型定制特定方法。本文将随机点击模型下的排序过程统一建模为马尔可夫决策过程,并可直接通过离线强化学习学习最优排序。基于此,我们利用离线强化学习技术处理离线策略学习排序,提出点击模型无关的离线策略统一学习排序方法,该方法可轻松应用于多种点击模型。通过专门设计的马尔可夫决策过程建模,我们证明离线强化学习算法无需复杂去偏技术或模型先验知识即可适应不同点击模型。在多个大规模数据集上的实验表明,CUOLR在不同点击模型下保持一致性与鲁棒性的同时,始终优于现有最先进的离线策略学习排序算法。