Sequential recommendation, where user preference is dynamically inferred from sequential historical behaviors, is a critical task in recommender systems (RSs). To further optimize long-term user engagement, offline reinforcement-learning-based RSs have become a mainstream technique as they provide an additional advantage in avoiding global explorations that may harm online users' experiences. However, previous studies mainly focus on discrete action and policy spaces, which might have difficulties in handling dramatically growing items efficiently. To mitigate this issue, in this paper, we aim to design an algorithmic framework applicable to continuous policies. To facilitate the control in the low-dimensional but dense user preference space, we propose an \underline{\textbf{E}}fficient \underline{\textbf{Co}}ntinuous \underline{\textbf{C}}ontrol framework (ECoC). Based on a statistically tested assumption, we first propose the novel unified action representation abstracted from normalized user and item spaces. Then, we develop the corresponding policy evaluation and policy improvement procedures. During this process, strategic exploration and directional control in terms of unified actions are carefully designed and crucial to final recommendation decisions. Moreover, beneficial from unified actions, the conservatism regularization for policies and value functions are combined and perfectly compatible with the continuous framework. The resulting dual regularization ensures the successful offline training of RL-based recommendation policies. Finally, we conduct extensive experiments to validate the effectiveness of our framework. The results show that compared to the discrete baselines, our ECoC is trained far more efficiently. Meanwhile, the final policies outperform baselines in both capturing the offline data and gaining long-term rewards.
翻译:序列推荐旨在从用户的历史行为序列中动态推断其偏好,是推荐系统中的一项关键任务。为了进一步优化用户的长期参与度,基于离线强化学习的推荐系统已成为主流技术,因其具备额外优势,能够避免可能损害在线用户体验的全局探索。然而,先前的研究主要集中于离散的动作和策略空间,这可能难以高效处理急剧增长的项目数量。为缓解此问题,本文旨在设计一个适用于连续策略的算法框架。为了便于在低维但密集的用户偏好空间中进行控制,我们提出了一个高效的连续控制框架。基于一个经过统计检验的假设,我们首先提出了从归一化的用户和项目空间抽象出的新颖统一动作表示。接着,我们开发了相应的策略评估与策略改进流程。在此过程中,针对统一动作的策略性探索和定向控制被精心设计,并对最终的推荐决策至关重要。此外,得益于统一动作,针对策略和价值函数的保守正则化得以结合,并与连续框架完美兼容。由此产生的双重正则化确保了基于强化学习的推荐策略能够成功进行离线训练。最后,我们进行了广泛的实验以验证我们框架的有效性。结果表明,与离散基线方法相比,我们的框架训练效率显著更高。同时,最终策略在捕捉离线数据和获取长期奖励方面均优于基线方法。