An Off-Policy Reinforcement Learning Algorithm Customized for Multi-Task Fusion in Large-Scale Recommender Systems

As the last critical stage of RSs, Multi-Task Fusion (MTF) is responsible for combining multiple scores outputted by Multi-Task Learning (MTL) into a final score to maximize user satisfaction, which determines the ultimate recommendation results. Recently, to optimize long-term user satisfaction within a recommendation session, Reinforcement Learning (RL) is used for MTF in the industry. However, the off-policy RL algorithms used for MTF so far have the following severe problems: 1) to avoid out-of-distribution (OOD) problem, their constraints are overly strict, which seriously damage their performance; 2) they are unaware of the exploration policy used for producing training data and never interact with real environment, so only suboptimal policy can be learned; 3) the traditional exploration policies are inefficient and hurt user experience. To solve the above problems, we propose a novel method named IntegratedRL-MTF customized for MTF in large-scale RSs. IntegratedRL-MTF integrates off-policy RL model with our online exploration policy to relax overstrict and complicated constraints, which significantly improves its performance. We also design an extremely efficient exploration policy, which eliminates low-value exploration space and focuses on exploring potential high-value state-action pairs. Moreover, we adopt progressive training mode to further enhance our model's performance with the help of our exploration policy. We conduct extensive offline and online experiments in the short video channel of Tencent News. The results demonstrate that our model outperforms other models remarkably. IntegratedRL-MTF has been fully deployed in our RS and other large-scale RSs in Tencent, which have achieved significant improvements.

翻译：作为推荐系统的最后关键阶段，多任务融合负责将多任务学习输出的多个分数合并为最终分数，以最大化用户满意度，该过程直接决定最终的推荐结果。近年来，为优化推荐会话内的长期用户满意度，工业界将强化学习应用于多任务融合。然而，当前用于多任务融合的离线策略强化学习算法存在以下严重问题：1）为避免分布外问题，约束条件过于严格，严重损害性能；2）算法未感知用于生成训练数据的探索策略，且从未与真实环境交互，因此只能学习次优策略；3）传统探索策略效率低下且损害用户体验。为解决上述问题，我们提出一种名为IntegratedRL-MTF的新方法，专门针对大规模推荐系统中的多任务融合定制。IntegratedRL-MTF将离线策略强化学习模型与我们的在线探索策略相结合，以放宽过于严格和复杂的约束条件，显著提升性能。我们还设计了一种极其高效的探索策略，该策略消除低价值探索空间，专注于探索潜在高价值状态-动作对。此外，我们采用渐进式训练模式，借助探索策略进一步强化模型性能。我们在腾讯新闻短视频频道开展了广泛的离线和在线实验，结果表明我们的模型显著优于其他模型。IntegratedRL-MTF已全面部署于我们的推荐系统及腾讯其他大规模推荐系统中，并取得了显著成效。

相关内容

RSS

关注 2

RSS（简易信息聚合，也叫聚合内容）是一种描述和同步网站内容的格式。RSS可以是以下三个解释的其中一个： Really Simple Syndication；RDF (Resource Description Framework) Site Summary； Rich Site Summary。但其实这三个解释都是指同一种Syndication的技术。

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日