As the last critical stage of RSs, Multi-Task Fusion (MTF) is responsible for combining multiple scores outputted by Multi-Task Learning (MTL) into a final score to maximize user satisfaction, which determines the ultimate recommendation results. Recently, to optimize long-term user satisfaction within a recommendation session, Reinforcement Learning (RL) is used for MTF in the industry. However, the off-policy RL algorithms used for MTF so far have the following severe problems: 1) to avoid out-of-distribution (OOD) problem, their constraints are overly strict, which seriously damage their performance; 2) they are unaware of the exploration policy used for producing training data and never interact with real environment, so only suboptimal policy can be learned; 3) the traditional exploration policies are inefficient and hurt user experience. To solve the above problems, we propose a novel method named IntegratedRL-MTF customized for MTF in large-scale RSs. IntegratedRL-MTF integrates off-policy RL model with our online exploration policy to relax overstrict and complicated constraints, which significantly improves its performance. We also design an extremely efficient exploration policy, which eliminates low-value exploration space and focuses on exploring potential high-value state-action pairs. Moreover, we adopt progressive training mode to further enhance our model's performance with the help of our exploration policy. We conduct extensive offline and online experiments in the short video channel of Tencent News. The results demonstrate that our model outperforms other models remarkably. IntegratedRL-MTF has been fully deployed in our RS and other large-scale RSs in Tencent, which have achieved significant improvements.
翻译:作为推荐系统的最后关键阶段,多任务融合负责将多任务学习输出的多个分数组合成最终分数,以最大化用户满意度,这决定了最终的推荐结果。近年来,为优化推荐会话内的长期用户满意度,业界开始将强化学习应用于多任务融合。然而,目前用于多任务融合的离线强化学习算法存在以下严重问题:1)为避免分布外问题,其约束条件过于严格,严重损害了性能;2)它们不了解用于生成训练数据的探索策略,且从不与真实环境交互,因此只能学习到次优策略;3)传统探索策略效率低下且损害用户体验。为解决上述问题,我们提出了一种名为IntegratedRL-MTF的新方法,专为大规模推荐系统中的多任务融合定制。IntegratedRL-MTF将离线强化学习模型与我们的在线探索策略相结合,以放宽过于严格和复杂的约束,从而显著提升其性能。我们还设计了一种极其高效的探索策略,该策略消除了低价值探索空间,专注于探索潜在的高价值状态-动作对。此外,我们采用渐进式训练模式,借助我们的探索策略进一步提升模型性能。我们在腾讯新闻短视频频道进行了广泛的离线和在线实验。结果表明,我们的模型显著优于其他模型。IntegratedRL-MTF已全面部署于我们的推荐系统及腾讯其他大规模推荐系统中,并取得了显著的效果提升。