We consider the problem of latent bandits with cluster structure where there are multiple users, each with an associated multi-armed bandit problem. These users are grouped into \emph{latent} clusters such that the mean reward vectors of users within the same cluster are identical. At each round, a user, selected uniformly at random, pulls an arm and observes a corresponding noisy reward. The goal of the users is to maximize their cumulative rewards. This problem is central to practical recommendation systems and has received wide attention of late \cite{gentile2014online, maillard2014latent}. Now, if each user acts independently, then they would have to explore each arm independently and a regret of $\Omega(\sqrt{\mathsf{MNT}})$ is unavoidable, where $\mathsf{M}, \mathsf{N}$ are the number of arms and users, respectively. Instead, we propose LATTICE (Latent bAndiTs via maTrIx ComplEtion) which allows exploitation of the latent cluster structure to provide the minimax optimal regret of $\widetilde{O}(\sqrt{(\mathsf{M}+\mathsf{N})\mathsf{T}})$, when the number of clusters is $\widetilde{O}(1)$. This is the first algorithm to guarantee such strong regret bound. LATTICE is based on a careful exploitation of arm information within a cluster while simultaneously clustering users. Furthermore, it is computationally efficient and requires only $O(\log{\mathsf{T}})$ calls to an offline matrix completion oracle across all $\mathsf{T}$ rounds.
翻译:本文研究具有聚类结构的隐式赌博机问题,其中存在多个用户,每个用户对应一个多臂赌博机问题。这些用户被划分为潜在(隐式)聚类,使得同一聚类内用户的平均奖励向量相同。每轮随机均匀选择一个用户,该用户拉动一个臂并观察到相应的含噪奖励。用户的目标是最大化其累积奖励。该问题是推荐系统的核心,近期受到广泛关注\ncite{gentile2014online, maillard2014latent}。若每个用户独立行动,则需独立探索每个臂,此时不可避免会产生$\Omega(\sqrt{\mathsf{MNT}})$的遗憾,其中$\mathsf{M}$和$\mathsf{N}$分别为臂数及用户数。为此,我们提出LATTICE(基于矩阵补全的潜在赌博机算法),通过利用潜聚类结构实现最小最大最优遗憾$\widetilde{O}(\sqrt{(\mathsf{M}+\mathsf{N})\mathsf{T}})$,其中聚类数为$\widetilde{O}(1)$。这是首个保证如此强遗憾界的算法。LATTICE在聚类用户的同时,精巧利用聚类内臂信息,且计算效率高:在全部$\mathsf{T}$轮中仅需调用$O(\log{\mathsf{T}})$次离线矩阵补全预言机。