In recommender systems, reinforcement learning solutions have shown promising results in optimizing the interaction sequence between users and the system over the long-term performance. For practical reasons, the policy's actions are typically designed as recommending a list of items to handle users' frequent and continuous browsing requests more efficiently. In this list-wise recommendation scenario, the user state is updated upon every request in the corresponding MDP formulation. However, this request-level formulation is essentially inconsistent with the user's item-level behavior. In this study, we demonstrate that an item-level optimization approach can better utilize item characteristics and optimize the policy's performance even under the request-level MDP. We support this claim by comparing the performance of standard request-level methods with the proposed item-level actor-critic framework in both simulation and online experiments. Furthermore, we show that a reward-based future decomposition strategy can better express the item-wise future impact and improve the recommendation accuracy in the long term. To achieve a more thorough understanding of the decomposition strategy, we propose a model-based re-weighting framework with adversarial learning that further boost the performance and investigate its correlation with the reward-based strategy.
翻译:在推荐系统中,强化学习解决方案已展现出在优化用户与系统交互序列长期性能方面的良好前景。出于实际考虑,策略的动作通常设计为推荐项目列表,以更高效地处理用户频繁连续的浏览请求。在这种列表式推荐场景下,用户状态在相应马尔可夫决策过程(MDP)建模中随每次请求更新。然而,这种请求级建模本质上与用户的项目级行为存在不一致性。本研究表明,即使在请求级MDP框架下,项目级优化方法仍能更好地利用项目特征并优化策略性能。我们通过对比标准请求级方法与所提出的项目级Actor-Critic框架在模拟实验和在线实验中的表现来支持这一论点。此外,我们证明基于奖励的未来分解策略能更准确地表达项目级未来影响,并提升长期推荐准确率。为深入理解该分解策略,我们提出一种基于模型的对抗学习重加权框架,该框架不仅进一步提升了性能,还揭示了其与奖励型策略之间的关联性。