Learning reinforcement learning (RL)-based recommenders from historical user-item interaction sequences is vital to generate high-reward recommendations and improve long-term cumulative benefits. However, existing RL recommendation methods encounter difficulties (i) to estimate the value functions for states which are not contained in the offline training data, and (ii) to learn effective state representations from user implicit feedback due to the lack of contrastive signals. In this work, we propose contrastive state augmentations (CSA) for the training of RL-based recommender systems. To tackle the first issue, we propose four state augmentation strategies to enlarge the state space of the offline data. The proposed method improves the generalization capability of the recommender by making the RL agent visit the local state regions and ensuring the learned value functions are similar between the original and augmented states. For the second issue, we propose introducing contrastive signals between augmented states and the state randomly sampled from other sessions to improve the state representation learning further. To verify the effectiveness of the proposed CSA, we conduct extensive experiments on two publicly accessible datasets and one dataset collected from a real-life e-commerce platform. We also conduct experiments on a simulated environment as the online evaluation setting. Experimental results demonstrate that CSA can effectively improve recommendation performance.
翻译:从历史用户-物品交互序列中学习基于强化学习的推荐器,对于生成高回报推荐并提升长期累积收益至关重要。然而,现有的强化学习推荐方法面临两大困难:(i) 难以估计离线训练数据中未包含状态的价值函数;(ii) 由于缺乏对比信号,难以从用户隐式反馈中学习有效的状态表示。本文提出了一种面向强化学习推荐系统训练的对比状态增强方法。针对第一个问题,我们设计了四种状态增强策略以扩展离线数据的状态空间,该方法通过促使强化学习智能体访问局部状态区域,并确保原始状态与增强状态间学习到的价值函数保持相似,从而提升推荐器的泛化能力。针对第二个问题,我们提出在增强状态与其他会话中随机采样的状态之间引入对比信号,以进一步改进状态表示学习。为验证所提CSA方法的有效性,我们在两个公开数据集和一个来自真实电子商务平台的数据集上开展了大量实验,并在模拟环境中进行了在线评估场景实验。实验结果表明,CSA能够有效提升推荐性能。