Off-policy learning is a framework for optimizing policies without deploying them, using data collected by another policy. In recommender systems, this is especially challenging due to the imbalance in logged data: some items are recommended and thus logged more frequently than others. This is further perpetuated when recommending a list of items, as the action space is combinatorial. To address this challenge, we study pessimistic off-policy optimization for learning to rank. The key idea is to compute lower confidence bounds on parameters of click models and then return the list with the highest pessimistic estimate of its value. This approach is computationally efficient and we analyze it. We study its Bayesian and frequentist variants, and overcome the limitation of unknown prior by incorporating empirical Bayes. To show the empirical effectiveness of our approach, we compare it to off-policy optimizers that use inverse propensity scores or neglect uncertainty. Our approach outperforms all baselines, is robust, and is also general.
翻译:离策略学习是一种无需部署策略即可优化策略的框架,它利用另一策略收集的数据进行优化。在推荐系统中,由于日志数据的不平衡性(某些物品被推荐并因此被记录得更频繁),这一问题尤为困难。当推荐物品列表时,这一挑战进一步加剧,因为动作空间呈组合爆炸特性。为解决该问题,我们研究了面向学习排序的悲观离策略优化方法。其核心思想是计算点击模型参数的置信下界,并返回具有最高悲观价值估计的物品列表。该方法计算高效且我们对其进行了理论分析。我们研究了其贝叶斯与频率学派变体,并通过引入经验贝叶斯方法克服了未知先验的限制。为展示该方法的实证有效性,我们将其与使用逆倾向评分或忽略不确定性的离策略优化器进行了对比。实验表明,我们的方法优于所有基线,具有鲁棒性,且具备通用性。