一种用于大规模推荐系统中多任务融合的增强状态强化学习算法 (An Enhanced-State Reinforcement Learning Algorithm for Multi-Task Fusion in Large-Scale Recommender Systems)

As the last key stage of Recommender Systems (RSs), Multi-Task Fusion (MTF) is in charge of combining multiple scores predicted by Multi-Task Learning (MTL) into a final score to maximize user satisfaction, which decides the ultimate recommendation results. In recent years, to maximize long-term user satisfaction within a recommendation session, Reinforcement Learning (RL) is widely used for MTF in large-scale RSs. However, limited by their modeling pattern, all the current RL-MTF methods can only utilize user features as the state to generate actions for each user, but unable to make use of item features and other valuable features, which leads to suboptimal results. Addressing this problem is a challenge that requires breaking through the current modeling pattern of RL-MTF. To solve this problem, we propose a novel method called Enhanced-State RL for MTF in RSs. Unlike the existing methods mentioned above, our method first defines user features, item features, and other valuable features collectively as the enhanced state; then proposes a novel actor and critic learning process to utilize the enhanced state to make much better action for each user-item pair. To the best of our knowledge, this novel modeling pattern is being proposed for the first time in the field of RL-MTF. We conduct extensive offline and online experiments in a large-scale RS. The results demonstrate that our model outperforms other models significantly. Enhanced-State RL has been fully deployed in our RS more than half a year, improving +3.84% user valid consumption and +0.58% user duration time compared to baseline.

翻译：作为推荐系统（RSs）的最后一个关键阶段，多任务融合（MTF）负责将多任务学习（MTL）预测的多个分数组合成一个最终分数，以最大化用户满意度，从而决定最终的推荐结果。近年来，为了在推荐会话中最大化用户的长期满意度，强化学习（RL）被广泛用于大规模RSs中的MTF。然而，受限于其建模模式，目前所有的RL-MTF方法只能利用用户特征作为状态来为每个用户生成动作，而无法利用物品特征和其他有价值的特征，这导致了次优的结果。解决这一问题是一个挑战，需要突破当前RL-MTF的建模模式。为了解决这个问题，我们提出了一种新颖的方法，称为用于RSs中MTF的增强状态RL。与上述现有方法不同，我们的方法首先将用户特征、物品特征和其他有价值的特征统一定义为增强状态；然后提出了一种新颖的actor和critic学习过程，以利用增强状态为每个用户-物品对做出更好的动作。据我们所知，这种新颖的建模模式在RL-MTF领域是首次被提出。我们在一个大规模RS中进行了广泛的离线和在线实验。结果表明，我们的模型显著优于其他模型。增强状态RL已在我们的RS中全面部署超过半年，与基线相比，用户有效消费提升了+3.84%，用户停留时长提升了+0.58%。