We study contextual bandits with low-rank structure where, in each round, if the (context, arm) pair $(i,j)\in [m]\times [n]$ is selected, the learner observes a noisy sample of the $(i,j)$-th entry of an unknown low-rank reward matrix. Successive contexts are generated randomly in an i.i.d. manner and are revealed to the learner. For such bandits, we present efficient algorithms for policy evaluation, best policy identification and regret minimization. For policy evaluation and best policy identification, we show that our algorithms are nearly minimax optimal. For instance, the number of samples required to return an $\varepsilon$-optimal policy with probability at least $1-\delta$ typically scales as ${m+n\over \varepsilon^2}\log(1/\delta)$. Our regret minimization algorithm enjoys minimax guarantees scaling as $r^{7/4}(m+n)^{3/4}\sqrt{T}$, which improves over existing algorithms. All the proposed algorithms consist of two phases: they first leverage spectral methods to estimate the left and right singular subspaces of the low-rank reward matrix. We show that these estimates enjoy tight error guarantees in the two-to-infinity norm. This in turn allows us to reformulate our problems as a misspecified linear bandit problem with dimension roughly $r(m+n)$ and misspecification controlled by the subspace recovery error, as well as to design the second phase of our algorithms efficiently.
翻译:我们研究具有低秩结构的上下文赌博机问题:在每一轮中,若选择(上下文, 臂)对$(i,j)\in [m]\times [n]$,学习器观测到未知低秩奖励矩阵$(i,j)$位置条目的含噪样本。后续上下文以独立同分布方式随机生成并呈现给学习器。针对此类赌博机,我们提出了用于策略评估、最优策略识别及遗憾最小化的高效算法。对于策略评估和最优策略识别,我们证明所提算法近乎达到了极小极大最优。例如,以至少$1-\delta$概率返回$\varepsilon$-最优策略所需样本数通常为${m+n\over \varepsilon^2}\log(1/\delta)$量级。我们的遗憾最小化算法具有$r^{7/4}(m+n)^{3/4}\sqrt{T}$量级的极小化最优保证,优于现有算法。所有提出的算法均包含两个阶段:首先利用谱方法估计低秩奖励矩阵的左右奇异子空间。我们证明这些估计在"两到无穷范数"下具有紧的误差保证。这进而允许我们将问题重新表述为维度约为$r(m+n)$且误设定由子空间恢复误差控制的误设定线性赌博机问题,并高效设计算法的第二阶段。