In recommender systems, reinforcement learning solutions have shown promising results in optimizing the interaction sequence between users and the system over the long-term performance. For practical reasons, the policy's actions are typically designed as recommending a list of items to handle users' frequent and continuous browsing requests more efficiently. In this list-wise recommendation scenario, the user state is updated upon every request in the corresponding MDP formulation. However, this request-level formulation is essentially inconsistent with the user's item-level behavior. In this study, we demonstrate that an item-level optimization approach can better utilize item characteristics and optimize the policy's performance even under the request-level MDP. We support this claim by comparing the performance of standard request-level methods with the proposed item-level actor-critic framework in both simulation and online experiments. Furthermore, we found that the naive equal decomposition of future values may not effectively express the item-wise utility in the long term. To address this issue, we propose a future decomposition strategy based on each item's immediate reward, and further show that we can obtain more advanced settings of weight through adversarial learning.
翻译:在推荐系统中,强化学习解决方案在优化用户与系统交互序列的长期性能方面已展现出良好效果。出于实际考虑,策略的动作通常设计为推荐一个项目列表,以更高效地处理用户频繁持续的浏览请求。在这种列表级推荐场景中,用户状态在对应马尔可夫决策过程(MDP)的每次请求时更新。然而,这种请求级公式化本质上与用户的项目级行为不一致。在本研究中,我们论证了即使在请求级MDP框架下,项目级优化方法也能更好地利用项目特征并优化策略性能。我们通过对比标准请求级方法与所提出的项目级演员-评论家框架在仿真和在线实验中的表现来支持这一观点。此外,我们发现对未来价值的简单等分可能无法有效表达项目维度的长期效用。为解决此问题,我们提出一种基于每个项目即时奖励的未来分解策略,并进一步表明可以通过对抗性学习获得更高级的权重设置。