We study contextual bandits with low-rank structure where, in each round, if the (context, arm) pair $(i,j)\in [m]\times [n]$ is selected, the learner observes a noisy sample of the $(i,j)$-th entry of an unknown low-rank reward matrix. Successive contexts are generated randomly in an i.i.d. manner and are revealed to the learner. For such bandits, we present efficient algorithms for policy evaluation, best policy identification and regret minimization. For policy evaluation and best policy identification, we show that our algorithms are nearly minimax optimal. For instance, the number of samples required to return an $\varepsilon$-optimal policy with probability at least $1-\delta$ typically scales as ${r(m+n)\over \varepsilon^2}\log(1/\delta)$. Our regret minimization algorithm enjoys minimax guarantees typically scaling as $r^{7/4}(m+n)^{3/4}\sqrt{T}$, which improves over existing algorithms. All the proposed algorithms consist of two phases: they first leverage spectral methods to estimate the left and right singular subspaces of the low-rank reward matrix. We show that these estimates enjoy tight error guarantees in the two-to-infinity norm. This in turn allows us to reformulate our problems as a misspecified linear bandit problem with dimension roughly $r(m+n)$ and misspecification controlled by the subspace recovery error, as well as to design the second phase of our algorithms efficiently.
翻译:我们研究具有低秩结构的上下文赌博机问题。在每一轮中,若选择(上下文,臂)对$(i,j)\in [m]\times [n]$,学习者会观测到一个未知低秩奖励矩阵第$(i,j)$个元素的噪声样本。连续上下文以独立同分布方式随机生成并向学习者揭示。针对此类赌博机,我们提出了用于策略评估、最优策略识别和遗憾最小化的高效算法。对于策略评估和最优策略识别,我们证明了所提算法近乎达到极小极大最优。例如,以至少$1-\delta$概率返回$\varepsilon$最优策略所需样本量通常按${r(m+n)\over \varepsilon^2}\log(1/\delta)$缩放。我们的遗憾最小化算法具有通常按$r^{7/4}(m+n)^{3/4}\sqrt{T}$缩放的极小化极大保证,该结果优于现有算法。所有提出的算法均包含两个阶段:首先利用谱方法估计低秩奖励矩阵的左右奇异子空间。我们证明了这些估计在二到无穷范数下具有紧致的误差保证。这使我们能够将问题重新表述为维度约$r(m+n)$的误设线性赌博机问题,其误设程度受子空间恢复误差控制,并据此高效设计算法的第二阶段。