Bandit Learning to Rank with Position-Based Click Models: Personalized and Equal Treatments

Online learning to rank (ONL2R) is a foundational problem for recommender systems and has received increasing attention in recent years. Among the existing approaches for ONL2R, a natural modeling architecture is the multi-armed bandit framework coupled with the position-based click model. However, developing efficient online learning policies for MAB-based ONL2R with position-based click models is highly challenging due to the combinatorial nature of the problem, and partial observability in the position-based click model. To date, results in MAB-based ONL2R with position-based click models remain rather limited, which motivates us to fill this gap in this work. Our main contributions in this work are threefold: i) We propose the first general MAB framework that captures all key ingredients of ONL2R with position-based click models. Our model considers personalized and equal treatments in ONL2R ranking recommendations, both of which are widely used in practice; ii) Based on the above analytical framework, we develop two unified greed- and UCB-based policies called GreedyRank and UCBRank, each of which can be applied to personalized and equal ranking treatments; and iii) We show that both GreedyRank and UCBRank enjoy $O(\sqrt{t}\ln t)$ and $O(\sqrt{t\ln t})$ anytime sublinear regret for personalized and equal treatment, respectively. For the fundamentally hard equal ranking treatment, we identify classes of collective utility functions and their associated sufficient conditions under which $O(\sqrt{t}\ln t)$ and $O(\sqrt{t\ln t})$ anytime sublinear regrets are still achievable for GreedyRank and UCBRank, respectively. Our numerical experiments also verify our theoretical results and demonstrate the efficiency of GreedyRank and UCBRank in seeking the optimal action under various problem settings.

翻译：在线学习排序(ONL2R)是推荐系统的基础问题，近年来受到越来越多的关注。在现有的ONL2R方法中，一种自然的建模框架是将多臂Bandit框架与基于位置的点击模型相结合。然而，由于问题的组合性质以及基于位置点击模型中的部分可观测性，为基于MAB的ONL2R开发高效的在线学习策略极具挑战性。迄今为止，基于位置点击模型的MAB型ONL2R研究成果仍相当有限，这促使我们在本研究工作中填补这一空白。本文的主要贡献有三方面：i) 我们首次提出一个通用的MAB框架，该框架涵盖了基于位置点击模型的ONL2R的所有关键要素。我们的模型考虑了ONL2R排序推荐中的个性化与公平对待两种场景，这两种场景在实际应用中均广泛存在；ii) 基于上述分析框架，我们开发了两种基于贪婪策略和UCB策略的统一算法——GreedyRank和UCBRank，这两种算法均可应用于个性化排序和公平排序场景；iii) 我们证明在个性化处理和公平处理条件下，GreedyRank和UCBRank分别能达到$O(\sqrt{t}\ln t)$和$O(\sqrt{t\ln t})$的任意时间次线性遗憾。对于本质上困难的公平排序处理，我们识别了集体效用函数的类别及其相关充分条件，在这些条件下GreedyRank和UCBRank仍能分别实现$O(\sqrt{t}\ln t)$和$O(\sqrt{t\ln t})$的任意时间次线性遗憾。数值实验也验证了我们的理论结果，并证明了GreedyRank和UCBRank在各种问题设置下寻求最优动作的有效性。