Model-free RL-based recommender systems have recently received increasing research attention due to their capability to handle partial feedback and long-term rewards. However, most existing research has ignored a critical feature in recommender systems: one user's feedback on the same item at different times is random. The stochastic rewards property essentially differs from that in classic RL scenarios with deterministic rewards, which makes RL-based recommender systems much more challenging. In this paper, we first demonstrate in a simulator environment where using direct stochastic feedback results in a significant drop in performance. Then to handle the stochastic feedback more efficiently, we design two stochastic reward stabilization frameworks that replace the direct stochastic feedback with that learned by a supervised model. Both frameworks are model-agnostic, i.e., they can effectively utilize various supervised models. We demonstrate the superiority of the proposed frameworks over different RL-based recommendation baselines with extensive experiments on a recommendation simulator as well as an industrial-level recommender system.
翻译:基于模型无关的强化学习推荐系统因能处理部分反馈和长期奖励而受到越来越多研究关注。然而,现有研究大多忽略了推荐系统中的关键特征:同一用户在不同时间对同一商品的反馈具有随机性。这种随机奖励属性与经典强化学习场景中的确定性奖励存在本质差异,显著增加了基于强化学习的推荐系统的挑战性。本文首先通过仿真环境证明,直接使用随机反馈会导致性能显著下降。为更高效处理随机反馈,我们设计了两种随机奖励稳定框架,用监督模型学习的反馈替代直接随机反馈。两个框架均具有模型无关性,即能有效利用多种监督模型。通过在推荐模拟器及工业级推荐系统上的大量实验,我们证明了所提框架相较于不同强化学习推荐基线的优越性。