Reinforcement learning (RL) has been widely applied in recommendation systems due to its potential in optimizing the long-term engagement of users. From the perspective of RL, recommendation can be formulated as a Markov decision process (MDP), where recommendation system (agent) can interact with users (environment) and acquire feedback (reward signals).However, it is impractical to conduct online interactions with the concern on user experience and implementation complexity, and we can only train RL recommenders with offline datasets containing limited reward signals and state transitions. Therefore, the data sparsity issue of reward signals and state transitions is very severe, while it has long been overlooked by existing RL recommenders.Worse still, RL methods learn through the trial-and-error mode, but negative feedback cannot be obtained in implicit feedback recommendation tasks, which aggravates the overestimation problem of offline RL recommender. To address these challenges, we propose a novel RL recommender named model-enhanced contrastive reinforcement learning (MCRL). On the one hand, we learn a value function to estimate the long-term engagement of users, together with a conservative value learning mechanism to alleviate the overestimation problem.On the other hand, we construct some positive and negative state-action pairs to model the reward function and state transition function with contrastive learning to exploit the internal structure information of MDP. Experiments demonstrate that the proposed method significantly outperforms existing offline RL and self-supervised RL methods with different representative backbone networks on two real-world datasets.
翻译:强化学习(RL)因其在优化用户长期参与度方面的潜力,已被广泛应用于推荐系统。从RL视角来看,推荐可被建模为马尔可夫决策过程(MDP),其中推荐系统(智能体)与用户(环境)交互并获取反馈(奖励信号)。然而,由于对用户体验和实现复杂性的考量,在线交互难以实施,我们仅能使用包含有限奖励信号和状态转移的离线数据集训练RL推荐器。因此,奖励信号和状态转移的数据稀疏性问题极为严重,而现有RL推荐器对此长期忽视。更糟糕的是,RL方法通过试错模式学习,但在隐式反馈推荐任务中无法获得负反馈,这加剧了离线RL推荐器的过估计问题。为解决这些挑战,我们提出一种新型RL推荐器——基于模型的对比强化学习(MCRL)。一方面,我们学习价值函数以估计用户长期参与度,并结合保守价值学习机制缓解过估计问题。另一方面,我们构建正负状态-动作对,通过对比学习建模奖励函数和状态转移函数,以挖掘MDP的内部结构信息。实验表明,在两个真实世界数据集上,该方法在不同代表性骨干网络中均显著优于现有离线RL和自监督RL方法。