Off-policy learning, referring to the procedure of policy optimization with access only to logged feedback data, has shown importance in various real-world applications, such as search engines, recommender systems, and etc. While the ground-truth logging policy, which generates the logged data, is usually unknown, previous work simply takes its estimated value in off-policy learning, ignoring both high bias and high variance resulted from such an estimator, especially on samples with small and inaccurately estimated logging probabilities. In this work, we explicitly model the uncertainty in the estimated logging policy and propose a Uncertainty-aware Inverse Propensity Score estimator (UIPS) for improved off-policy learning, with a theoretical convergence guarantee. Experiment results on synthetic and three real-world recommendation datasets demonstrate the advantageous sample efficiency of the proposed UIPS estimator against an extensive list of state-of-the-art baselines.
翻译:离策略学习是指在仅能访问记录反馈数据的情况下进行策略优化的过程,其在搜索引擎、推荐系统等实际应用中具有重要意义。由于生成记录数据的真实记录策略通常未知,现有工作往往在离策略学习中直接采用其估计值,忽略了这种估计器带来的高偏差和高方差问题,尤其在记录概率较小且估计不准确的样本上。本文对估计记录策略中的不确定性进行了显式建模,提出了一种不确定性感知的逆倾向得分估计器(UIPS),用于改进离策略学习,并给出了理论收敛性保证。在合成数据集和三个真实世界推荐数据集上的实验结果表明,与一系列最先进的基线方法相比,所提出的UIPS估计器具有优越的样本效率。