We present a novel podcast recommender system deployed at industrial scale. This system successfully optimizes personal listening journeys that unfold over months for hundreds of millions of listeners. In deviating from the pervasive industry practice of optimizing machine learning algorithms for short-term proxy metrics, the system substantially improves long-term performance in A/B tests. The paper offers insights into how our methods cope with attribution, coordination, and measurement challenges that usually hinder such long-term optimization. To contextualize these practical insights within a broader academic framework, we turn to reinforcement learning (RL). Using the language of RL, we formulate a comprehensive model of users' recurring relationships with a recommender system. Then, within this model, we identify our approach as a policy improvement update to a component of the existing recommender system, enhanced by tailored modeling of value functions and user-state representations. Illustrative offline experiments suggest this specialized modeling reduces data requirements by as much as a factor of 120,000 compared to black-box approaches.
翻译:本文提出了一种在工业规模上部署的新型播客推荐系统。该系统成功优化了数亿听众长达数月的个人收听历程。通过摒弃业界普遍采用的、针对短期代理指标优化机器学习算法的做法,该系统在A/B测试中显著提升了长期性能。本文深入阐述了我们的方法如何应对通常阻碍此类长期优化的归因、协调和测量挑战。为了将这些实践洞见置于更广泛的学术框架中,我们转向强化学习(RL)。运用RL的语言,我们构建了一个用户与推荐系统之间循环互动的综合模型。在此模型框架内,我们将自身方法定位为对现有推荐系统组件的策略改进更新,并通过针对价值函数和用户状态表示的定制化建模加以增强。说明性离线实验表明,与黑盒方法相比,这种专业化建模可将数据需求降低高达120,000倍。