Current advances in recommender systems have been remarkably successful in optimizing immediate engagement. However, long-term user engagement, a more desirable performance metric, remains difficult to improve. Meanwhile, recent reinforcement learning (RL) algorithms have shown their effectiveness in a variety of long-term goal optimization tasks. For this reason, RL is widely considered as a promising framework for optimizing long-term user engagement in recommendation. Though promising, the application of RL heavily relies on well-designed rewards, but designing rewards related to long-term user engagement is quite difficult. To mitigate the problem, we propose a novel paradigm, recommender systems with human preferences (or Preference-based Recommender systems), which allows RL recommender systems to learn from preferences about users historical behaviors rather than explicitly defined rewards. Such preferences are easily accessible through techniques such as crowdsourcing, as they do not require any expert knowledge. With PrefRec, we can fully exploit the advantages of RL in optimizing long-term goals, while avoiding complex reward engineering. PrefRec uses the preferences to automatically train a reward function in an end-to-end manner. The reward function is then used to generate learning signals to train the recommendation policy. Furthermore, we design an effective optimization method for PrefRec, which uses an additional value function, expectile regression and reward model pre-training to improve the performance. We conduct experiments on a variety of long-term user engagement optimization tasks. The results show that PrefRec significantly outperforms previous state-of-the-art methods in all the tasks.
翻译:当前推荐系统在优化即时用户参与方面取得了显著成功。然而,长期用户参与这一更理想的性能指标,仍难以得到有效提升。与此同时,近期强化学习算法在多种长期目标优化任务中展现出有效性。因此,强化学习被广泛视为优化推荐系统长期用户参与的可行框架。尽管前景广阔,但强化学习的应用高度依赖于精心设计的奖励函数,而设计针对长期用户参与的奖励函数却相当困难。为缓解该问题,我们提出一种新范式——融入人类偏好的推荐系统(或称基于偏好的推荐系统),该方法允许强化学习推荐系统从用户历史行为的偏好中学习,而非依赖显式定义的奖励。此类偏好可通过众包等技术轻松获取,且无需任何专家知识。借助PrefRec,我们可充分利用强化学习在优化长期目标方面的优势,同时避免复杂的奖励工程。PrefRec利用偏好以端到端方式自动训练奖励函数,再通过该奖励函数生成学习信号以训练推荐策略。此外,我们为PrefRec设计了一种高效优化方法,通过引入额外价值函数、分位数回归和奖励模型预训练提升性能。我们在多种长期用户参与优化任务上开展实验,结果表明PrefRec在所有任务中均显著优于以往最先进方法。