UOEP: User-Oriented Exploration Policy for Enhancing Long-Term User Experiences in Recommender Systems

Reinforcement learning (RL) has gained traction for enhancing user long-term experiences in recommender systems by effectively exploring users' interests. However, modern recommender systems exhibit distinct user behavioral patterns among tens of millions of items, which increases the difficulty of exploration. For example, user behaviors with different activity levels require varying intensity of exploration, while previous studies often overlook this aspect and apply a uniform exploration strategy to all users, which ultimately hurts user experiences in the long run. To address these challenges, we propose User-Oriented Exploration Policy (UOEP), a novel approach facilitating fine-grained exploration among user groups. We first construct a distributional critic which allows policy optimization under varying quantile levels of cumulative reward feedbacks from users, representing user groups with varying activity levels. Guided by this critic, we devise a population of distinct actors aimed at effective and fine-grained exploration within its respective user group. To simultaneously enhance diversity and stability during the exploration process, we further introduce a population-level diversity regularization term and a supervision module. Experimental results on public recommendation datasets demonstrate that our approach outperforms all other baselines in terms of long-term performance, validating its user-oriented exploration effectiveness. Meanwhile, further analyses reveal our approach's benefits of improved performance for low-activity users as well as increased fairness among users.

翻译：强化学习（RL）因其能有效探索用户兴趣，在提升推荐系统中用户长期体验方面受到广泛关注。然而，现代推荐系统在数千万物品中展现出差异化的用户行为模式，这增加了探索的难度。例如，不同活跃度用户的行为需要不同强度的探索，而以往研究往往忽视这一方面，对所有用户采用统一的探索策略，最终损害了长期用户体验。为应对这些挑战，我们提出面向用户的探索策略（UOEP），一种促进用户群体间细粒度探索的新方法。我们首先构建一个分布评论家，允许在用户累积奖励反馈的不同分位数水平下进行策略优化，以代表不同活跃度的用户群体。在此评论家的指导下，我们设计了一组不同的执行器，旨在各自对应的用户群体内进行有效且细粒度的探索。为在探索过程中同时增强多样性与稳定性，我们进一步引入了群体级多样性正则化项和一个监督模块。在公开推荐数据集上的实验结果表明，我们的方法在长期性能方面优于所有基线，验证了其面向用户探索的有效性。同时，进一步分析揭示了我们的方法在提升低活跃用户性能以及增强用户间公平性方面的优势。