Long-term user engagement (LTE) optimization in sequential recommender systems (SRS) is shown to be suited by reinforcement learning (RL) which finds a policy to maximize long-term rewards. Meanwhile, RL has its shortcomings, particularly requiring a large number of online samples for exploration, which is risky in real-world applications. One of the appealing ways to avoid the risk is to build a simulator and learn the optimal recommendation policy in the simulator. In LTE optimization, the simulator is to simulate multiple users' daily feedback for given recommendations. However, building a user simulator with no reality-gap, i.e., can predict user's feedback exactly, is unrealistic because the users' reaction patterns are complex and historical logs for each user are limited, which might mislead the simulator-based recommendation policy. In this paper, we present a practical simulator-based recommender policy training approach, Simulation-to-Recommendation (Sim2Rec) to handle the reality-gap problem for LTE optimization. Specifically, Sim2Rec introduces a simulator set to generate various possibilities of user behavior patterns, then trains an environment-parameter extractor to recognize users' behavior patterns in the simulators. Finally, a context-aware policy is trained to make the optimal decisions on all of the variants of the users based on the inferred environment-parameters. The policy is transferable to unseen environments (e.g., the real world) directly as it has learned to recognize all various user behavior patterns and to make the correct decisions based on the inferred environment-parameters. Experiments are conducted in synthetic environments and a real-world large-scale ride-hailing platform, DidiChuxing. The results show that Sim2Rec achieves significant performance improvement, and produces robust recommendations in unseen environments.
翻译:序列推荐系统(SRS)中的长期用户参与(LTE)优化被证明适合采用强化学习(RL),后者通过寻找策略以最大化长期奖励。然而,RL存在缺陷,特别是需要大量在线样本进行探索,这在真实世界应用中存在风险。规避风险的一种有效方法是构建模拟器,并在其中学习最优推荐策略。在LTE优化中,模拟器用于模拟多个用户对给定推荐的日常反馈。然而,构建无现实差距(即能精确预测用户反馈)的用户模拟器是不切实际的,因为用户的反应模式复杂且每个用户的历史日志有限,这可能导致基于模拟器的推荐策略出错。本文提出一种实用的基于模拟器的推荐策略训练方法——模拟到推荐(Sim2Rec),以应对LTE优化中的现实差距问题。具体而言,Sim2Rec引入模拟器集合生成用户行为模式的各种可能性,随后训练环境参数提取器识别模拟器中用户的行为模式。最后,基于识别出的环境参数,训练一个上下文感知策略,在所有用户变体上做出最优决策。该策略可直接迁移至未见环境(如真实世界),因其已学会识别各种用户行为模式,并基于推断的环境参数做出正确决策。实验在合成环境和真实世界大规模网约车平台滴滴出行上进行。结果表明,Sim2Rec实现了显著的性能提升,并在未见环境中生成稳健的推荐结果。