We consider the problem of \emph{blocked} collaborative bandits where there are multiple users, each with an associated multi-armed bandit problem. These users are grouped into \emph{latent} clusters such that the mean reward vectors of users within the same cluster are identical. Our goal is to design algorithms that maximize the cumulative reward accrued by all the users over time, under the \emph{constraint} that no arm of a user is pulled more than $\mathsf{B}$ times. This problem has been originally considered by \cite{Bresler:2014}, and designing regret-optimal algorithms for it has since remained an open problem. In this work, we propose an algorithm called \texttt{B-LATTICE} (Blocked Latent bAndiTs via maTrIx ComplEtion) that collaborates across users, while simultaneously satisfying the budget constraints, to maximize their cumulative rewards. Theoretically, under certain reasonable assumptions on the latent structure, with $\mathsf{M}$ users, $\mathsf{N}$ arms, $\mathsf{T}$ rounds per user, and $\mathsf{C}=O(1)$ latent clusters, \texttt{B-LATTICE} achieves a per-user regret of $\widetilde{O}(\sqrt{\mathsf{T}(1 + \mathsf{N}\mathsf{M}^{-1})}$ under a budget constraint of $\mathsf{B}=\Theta(\log \mathsf{T})$. These are the first sub-linear regret bounds for this problem, and match the minimax regret bounds when $\mathsf{B}=\mathsf{T}$. Empirically, we demonstrate that our algorithm has superior performance over baselines even when $\mathsf{B}=1$. \texttt{B-LATTICE} runs in phases where in each phase it clusters users into groups and collaborates across users within a group to quickly learn their reward models.
翻译:我们考虑“分组”协作式赌博机问题,其中存在多个用户,每个用户对应一个多臂赌博机问题。这些用户被划分为潜在聚类,使得同一聚类内用户的平均奖励向量相同。我们的目标是在约束条件下设计算法,以最大化所有用户随时间累积的总奖励,该约束条件为每个用户的每个臂最多被拉动B次。该问题最初由Bresler (2014)提出,但设计其遗憾最优算法至今仍是开放问题。本研究提出名为B-LATTICE(基于矩阵补全的分组潜在赌博机)的算法,该算法在满足预算约束的同时跨用户协作,以最大化累积奖励。理论上,在关于潜在结构的合理假设下(M个用户、N个臂、每个用户T轮、C=O(1)个潜在聚类),当预算约束B=Θ(log T)时,B-LATTICE算法实现每位用户的遗憾值为Õ(√(T(1 + N M⁻¹)))。这是该问题首次出现的次线性遗憾界,且在B=T时匹配极小极大遗憾界。实验表明,即使当B=1时,我们的算法性能也显著优于基线方法。B-LATTICE分阶段运行,每个阶段将用户聚为若干组,并通过组内用户协作快速学习其奖励模型。