Personalization of playlists is a common feature in music streaming services, but conventional techniques, such as collaborative filtering, rely on explicit assumptions regarding content quality to learn how to make recommendations. Such assumptions often result in misalignment between offline model objectives and online user satisfaction metrics. In this paper, we present a reinforcement learning framework that solves for such limitations by directly optimizing for user satisfaction metrics via the use of a simulated playlist-generation environment. Using this simulator we develop and train a modified Deep Q-Network, the action head DQN (AH-DQN), in a manner that addresses the challenges imposed by the large state and action space of our RL formulation. The resulting policy is capable of making recommendations from large and dynamic sets of candidate items with the expectation of maximizing consumption metrics. We analyze and evaluate agents offline via simulations that use environment models trained on both public and proprietary streaming datasets. We show how these agents lead to better user-satisfaction metrics compared to baseline methods during online A/B tests. Finally, we demonstrate that performance assessments produced from our simulator are strongly correlated with observed online metric results.
翻译:摘要:播放列表的个性化是音乐流媒体服务中的常见功能,但传统技术(如协同过滤)依赖于关于内容质量的显式假设来学习如何推荐。此类假设通常导致离线模型目标与在线用户满意度指标之间的不一致。本文提出一个强化学习框架,通过直接利用模拟播放列表生成环境优化用户满意度指标,从而解决上述局限性。基于该模拟器,我们开发并训练了一个改进的深度Q网络——动作头DQN(AH-DQN),以应对强化学习建模中状态与动作空间过大带来的挑战。最终策略能够从庞大且动态的候选项集合中做出推荐,预期最大化消费指标。我们通过使用在公开及专有流媒体数据集上训练的环境模型进行模拟,对代理(agent)进行离线分析与评估。在线A/B测试表明,与基线方法相比,这些代理能带来更优的用户满意度指标。最后,我们证明模拟器产生的性能评估与观测到的在线指标结果高度相关。