We consider a kernelized version of the $\epsilon$-greedy strategy for contextual bandits. More precisely, in a setting with finitely many arms, we consider that the mean reward functions lie in a reproducing kernel Hilbert space (RKHS). We propose an online weighted kernel ridge regression estimator for the reward functions. Under some conditions on the exploration probability sequence, $\{\epsilon_t\}_t$, and choice of the regularization parameter, $\{\lambda_t\}_t$, we show that the proposed estimator is consistent. We also show that for any choice of kernel and the corresponding RKHS, we achieve a sub-linear regret rate depending on the intrinsic dimensionality of the RKHS. Furthermore, we achieve the optimal regret rate of $\sqrt{T}$ under a margin condition for finite-dimensional RKHS.
翻译:我们考虑上下文集赌博机中$ε$-贪心策略的核化版本。具体而言,在有限臂设定下,假设奖励均值函数位于再生核希尔伯特空间(RKHS)中。我们提出一种在线加权核岭回归估计器来估计奖励函数。在探索概率序列$\{\epsilon_t\}_t$和正则化参数序列$\{\lambda_t\}_t$的适当条件下,我们证明所提估计量具有相合性。进一步地,对于任意核函数及其对应RKHS,我们可达到与RKHS本征维度相关的次线性遗憾界。此外,对于有限维RKHS,在边际条件下我们实现了$\sqrt{T}$的最优遗憾率。