We consider a contextual bandit problem with $S $ contexts and $A $ actions. In each round $t=1,2,\dots$ the learner observes a random context and chooses an action based on its past experience. The learner then observes a random reward whose mean is a function of the context and the action for the round. Under the assumption that the contexts can be lumped into $r\le \min\{S ,A \}$ groups such that the mean reward for the various actions is the same for any two contexts that are in the same group, we give an algorithm that outputs an $\epsilon$-optimal policy after using at most $\widetilde O(r (S +A )/\epsilon^2)$ samples with high probability and provide a matching $\widetilde\Omega(r (S +A )/\epsilon^2)$ lower bound. In the regret minimization setting, we give an algorithm whose cumulative regret up to time $T$ is bounded by $\widetilde O(\sqrt{r^3(S +A )T})$. To the best of our knowledge, we are the first to show the near-optimal sample complexity in the PAC setting and $\widetilde O(\sqrt{{poly}(r)(S+K)T})$ minimax regret in the online setting for this problem. We also show our algorithms can be applied to more general low-rank bandits and get improved regret bounds in some scenarios.
翻译:我们考虑一个具有$S$个上下文和$A$个动作的上下文赌博机问题。在每个轮次$t=1,2,\dots$中,学习器观察一个随机上下文,并根据其历史经验选择一个动作。随后学习器观察到随机奖励,其均值是该轮次上下文与动作的函数。在假设上下文可被合并为$r\le \min\{S, A\}$个组,且同一组内任意两个上下文对应各动作的均值奖励相同的前提下,我们提出一种算法,在保证高概率的情况下,使用最多$\widetilde O(r (S +A)/\epsilon^2)$个样本输出一个$\epsilon$-最优策略,并给出匹配的$\widetilde\Omega(r (S +A)/\epsilon^2)$下界。在遗憾最小化设定中,我们给出一种算法,其截至时间$T$的累计遗憾被$\widetilde O(\sqrt{r^3(S +A)T})$所界定。据我们所知,我们是首个针对该问题在PAC设定中展示出近优样本复杂度,并在在线设定中给出$\widetilde O(\sqrt{{poly}(r)(S+K)T})$极小极大遗憾的团队。我们还表明,我们的算法可应用于更一般的低秩赌博机问题,并在某些场景下获得改进的遗憾界。