Probabilistic learning to rank (LTR) has been the dominating approach for optimizing the ranking metric, but cannot maximize long-term rewards. Reinforcement learning models have been proposed to maximize user long-term rewards by formulating the recommendation as a sequential decision-making problem, but could only achieve inferior accuracy compared to LTR counterparts, primarily due to the lack of online interactions and the characteristics of ranking. In this paper, we propose a new off-policy value ranking (VR) algorithm that can simultaneously maximize user long-term rewards and optimize the ranking metric offline for improved sample efficiency in a unified Expectation-Maximization (EM) framework. We theoretically and empirically show that the EM process guides the leaned policy to enjoy the benefit of integration of the future reward and ranking metric, and learn without any online interactions. Extensive offline and online experiments demonstrate the effectiveness of our methods.
翻译:概率学习排序(LTR)一直是优化排序指标的主流方法,但无法最大化长期回报。为最大化用户长期回报,研究者提出将推荐问题建模为序列决策过程的强化学习模型,但其精度通常低于LTR方法,主要原因在于缺乏在线交互以及排序任务本身的特性。本文提出一种新的非策略价值排序(VR)算法,该算法通过统一的期望最大化(EM)框架,能够在离线环境中同时最大化用户长期回报并优化排序指标,从而提升样本效率。我们从理论上和实验上证明,EM过程指导学习策略在融合未来回报与排序指标的协同效应中获益,且无需任何在线交互。广泛的离线和在线实验验证了我们方法的有效性。