In recommender systems, reinforcement learning solutions have effectively boosted recommendation performance because of their ability to capture long-term user-system interaction. However, the action space of the recommendation policy is a list of items, which could be extremely large with a dynamic candidate item pool. To overcome this challenge, we propose a hyper-actor and critic learning framework where the policy decomposes the item list generation process into a hyper-action inference step and an effect-action selection step. The first step maps the given state space into a vectorized hyper-action space, and the second step selects the item list based on the hyper-action. In order to regulate the discrepancy between the two action spaces, we design an alignment module along with a kernel mapping function for items to ensure inference accuracy and include a supervision module to stabilize the learning process. We build simulated environments on public datasets and empirically show that our framework is superior in recommendation compared to standard RL baselines.
翻译:在推荐系统中,强化学习解决方案因其捕捉用户与系统长期交互的能力,有效提升了推荐性能。然而,推荐策略的动作空间是一个项目列表,该列表可能极其庞大且包含动态变化的候选项目池。为克服这一挑战,我们提出了一种超演员-评论家学习框架,该策略将项目列表生成过程分解为超动作推理步骤和效应动作选择步骤。第一步将给定状态空间映射为向量化的超动作空间,第二步基于超动作选择项目列表。为调节两个动作空间之间的差异,我们设计了一个对齐模块及项目的核映射函数以确保推理精度,并引入监督模块以稳定学习过程。我们在公开数据集上构建模拟环境,实验表明该框架在推荐性能上优于标准强化学习基线方法。