In recommender systems, reinforcement learning solutions have effectively boosted recommendation performance because of their ability to capture long-term user-system interaction. However, the action space of the recommendation policy is a list of items, which could be extremely large with a dynamic candidate item pool. To overcome this challenge, we propose a hyper-actor and critic learning framework where the policy decomposes the item list generation process into a hyper-action inference step and an effect-action selection step. The first step maps the given state space into a vectorized hyper-action space, and the second step selects the item list based on the hyper-action. In order to regulate the discrepancy between the two action spaces, we design an alignment module along with a kernel mapping function for items to ensure inference accuracy and include a supervision module to stabilize the learning process. We build simulated environments on public datasets and empirically show that our framework is superior in recommendation compared to standard RL baselines.
翻译:在推荐系统中,强化学习解决方案凭借其捕捉用户-系统长期交互的能力,有效提升了推荐性能。然而,推荐策略的动作空间是由候选物品构成的列表,随着候选物品池的动态变化,该列表可能极为庞大。为应对这一挑战,我们提出了一种超演员-评论家学习框架,该策略将物品列表生成过程分解为超动作推断步骤与效果动作选择步骤。第一步将给定状态空间映射为向量化的超动作空间,第二步基于超动作选择物品列表。为调节两个动作空间之间的差异,我们设计了包含物品核映射函数的对齐模块以确保推断精度,并引入了监督模块以稳定学习过程。我们在公开数据集上构建了模拟环境,实验结果表明,与标准强化学习基线相比,本文框架在推荐任务中具有显著优势。