Bandit learning has been an increasingly popular design choice for recommender system. Despite the strong interest in bandit learning from the community, there remains multiple bottlenecks that prevent many bandit learning approaches from productionalization. One major bottleneck is how to test the effectiveness of bandit algorithm with fairness and without data leakage. Different from supervised learning algorithms, bandit learning algorithms emphasize greatly on the data collection process through their explorative nature. Such explorative behavior may induce unfair evaluation in a classic A/B test setting. In this work, we apply upper confidence bound (UCB) to our large scale short video recommender system and present a test framework for the production bandit learning life-cycle with a new set of metrics. Extensive experiment results show that our experiment design is able to fairly evaluate the performance of bandit learning in the recommender system.
翻译:Bandit学习已成为推荐系统中日益流行的设计选择。尽管社区对Bandit学习表现出浓厚兴趣,但仍存在多个瓶颈阻碍了众多Bandit学习方法投入生产化。其中一个主要瓶颈是如何公平地测试Bandit算法的有效性,同时避免数据泄露。与监督学习算法不同,Bandit学习算法通过其探索性行为极大地强调数据收集过程。这种探索性行为可能在经典的A/B测试设置中引发不公平评估。在本工作中,我们将上置信界(UCB)应用于我们的大规模短视频推荐系统,并提出了一个面向生产级Bandit学习生命周期的测试框架及一组新的评估指标。大量实验结果表明,我们的实验设计能够公平地评估推荐系统中Bandit学习的性能。