Long-term engagement is preferred over immediate engagement in sequential recommendation as it directly affects product operational metrics such as daily active users (DAUs) and dwell time. Meanwhile, reinforcement learning (RL) is widely regarded as a promising framework for optimizing long-term engagement in sequential recommendation. However, due to expensive online interactions, it is very difficult for RL algorithms to perform state-action value estimation, exploration and feature extraction when optimizing long-term engagement. In this paper, we propose ResAct which seeks a policy that is close to, but better than, the online-serving policy. In this way, we can collect sufficient data near the learned policy so that state-action values can be properly estimated, and there is no need to perform online exploration. ResAct optimizes the policy by first reconstructing the online behaviors and then improving it via a Residual Actor. To extract long-term information, ResAct utilizes two information-theoretical regularizers to confirm the expressiveness and conciseness of features. We conduct experiments on a benchmark dataset and a large-scale industrial dataset which consists of tens of millions of recommendation requests. Experimental results show that our method significantly outperforms the state-of-the-art baselines in various long-term engagement optimization tasks.
翻译:长期参与相较于即时参与更受青睐,因为它直接影响日常活跃用户数、停留时间等产品运营指标。同时,强化学习被广泛视为优化序列推荐中长期参与行为的有效框架。然而,由于在线交互成本高昂,RL算法在优化长期参与时难以进行状态-动作价值估计、探索和特征提取。本文提出ResAct方法,通过寻找一个接近但优于在线服务策略的策略,使得我们能够在学习策略附近收集充足数据,从而准确估计状态-动作价值,无需进行在线探索。ResAct首先重建在线行为,随后通过残差Actor对策略进行优化。为提取长期信息,ResAct利用两个信息论正则化项确保特征的表达性与简洁性。我们在基准数据集和包含数千万推荐请求的大规模工业数据集上进行实验,结果表明,本方法在各种长期参与优化任务中显著优于现有最优基准方法。