Bandit learning algorithms have been an increasingly popular design choice for recommender systems. Despite the strong interest in bandit learning from the community, there remains multiple bottlenecks that prevent many bandit learning approaches from productionalization. Two of the most important bottlenecks are scaling to multi-task and A/B testing. Classic bandit algorithms, especially those leveraging contextual information, often requires reward for uncertainty estimation, which hinders their adoptions in multi-task recommender systems. Moreover, different from supervised learning algorithms, bandit learning algorithms emphasize greatly on the data collection process through their explorative nature. Such explorative behavior induces unfair evaluation for bandit learning agents in a classic A/B test setting. In this work, we present a novel design of production bandit learning life-cycle for recommender systems, along with a novel set of metrics to measure their efficiency in user exploration. We show through large-scale production recommender system experiments and in-depth analysis that our bandit agent design improves personalization for the production recommender system and our experiment design fairly evaluates the performance of bandit learning algorithms.
翻译:赌博机学习算法已成为推荐系统中日益流行的设计选择。尽管社区对赌博机学习有浓厚兴趣,但仍存在多个瓶颈阻碍了许多赌博机学习方法的产品化。其中两个最重要的瓶颈是扩展到多任务场景和A/B测试。经典赌博机算法,尤其是那些利用上下文信息的算法,通常需要奖励来进行不确定性估计,这阻碍了它们在多任务推荐系统中的采用。此外,与监督学习算法不同,赌博机学习算法通过其探索性质极大地强调数据收集过程。这种探索行为在经典A/B测试设置中导致对赌博机学习代理的不公平评估。在这项工作中,我们提出了一种用于推荐系统的生产级赌博机学习生命周期的新颖设计,以及一套新颖的指标来衡量其在用户探索中的效率。我们通过大规模生产推荐系统实验和深入分析表明,我们的赌博机代理设计改善了对生产推荐系统的个性化,并且我们的实验设计公平地评估了赌博机学习算法的性能。