We study linear contextual bandits in the misspecified setting, where the expected reward function can be approximated by a linear function class up to a bounded misspecification level $\zeta>0$. We propose an algorithm based on a novel data selection scheme, which only selects the contextual vectors with large uncertainty for online regression. We show that, when the misspecification level $\zeta$ is dominated by $\tilde O (\Delta / \sqrt{d})$ with $\Delta$ being the minimal sub-optimality gap and $d$ being the dimension of the contextual vectors, our algorithm enjoys the same gap-dependent regret bound $\tilde O (d^2/\Delta)$ as in the well-specified setting up to logarithmic factors. In addition, we show that an existing algorithm SupLinUCB (Chu et al., 2011) can also achieve a gap-dependent constant regret bound without the knowledge of sub-optimality gap $\Delta$. Together with a lower bound adapted from Lattimore et al. (2020), our result suggests an interplay between misspecification level and the sub-optimality gap: (1) the linear contextual bandit model is efficiently learnable when $\zeta \leq \tilde O(\Delta / \sqrt{d})$; and (2) it is not efficiently learnable when $\zeta \geq \tilde \Omega({\Delta} / {\sqrt{d}})$. Experiments on both synthetic and real-world datasets corroborate our theoretical results.
翻译:我们研究模型误设情境下的线性情境赌博机问题,其中期望奖励函数可由线性函数类在有限误设水平 $\zeta>0$ 下近似。我们提出一种基于新型数据选择方案的算法,该方案仅选择具有较大不确定性的情境向量进行在线回归。研究表明,当误设水平 $\zeta$ 受 $\tilde O (\Delta / \sqrt{d})$ 主导时(其中 $\Delta$ 为最小次优间隙,$d$ 为情境向量维度),我们的算法在对数因子范围内享有与良好指定设定下相同的间隙相关遗憾界 $\tilde O (d^2/\Delta)$。此外,我们发现现有算法 SupLinUCB(Chu 等,2011)无需知晓次优间隙 $\Delta$ 即可实现间隙相关的常数遗憾界。结合 Lattimore 等(2020)的最优下界,我们的结果揭示了误设水平与次优间隙之间的相互作用:(1)当 $\zeta \leq \tilde O(\Delta / \sqrt{d})$ 时,线性情境赌博机模型可高效学习;(2)当 $\zeta \geq \tilde \Omega({\Delta} / {\sqrt{d}})$ 时,模型无法高效学习。在合成数据集与真实数据集上的实验验证了我们的理论结果。