We study a generalization of the problem of online learning in adversarial linear contextual bandits by incorporating loss functions that belong to a reproducing kernel Hilbert space, which allows for a more flexible modeling of complex decision-making scenarios. We propose a computationally efficient algorithm that makes use of a new optimistically biased estimator for the loss functions and achieves near-optimal regret guarantees under a variety of eigenvalue decay assumptions made on the underlying kernel. Specifically, under the assumption of polynomial eigendecay with exponent $c>1$, the regret is $\widetilde{O}(KT^{\frac{1}{2}(1+\frac{1}{c})})$, where $T$ denotes the number of rounds and $K$ the number of actions. Furthermore, when the eigendecay follows an exponential pattern, we achieve an even tighter regret bound of $\widetilde{O}(\sqrt{T})$. These rates match the lower bounds in all special cases where lower bounds are known at all, and match the best known upper bounds available for the more well-studied stochastic counterpart of our problem.
翻译:我们研究对抗性线性上下文赌博机在线学习问题的一种推广,通过引入属于再生核希尔伯特空间的损失函数,从而能够更灵活地建模复杂决策场景。我们提出一种计算高效的算法,该算法利用一种新的乐观偏置损失函数估计器,并在对底层核的各种特征值衰减假设下实现了接近最优的遗憾保证。具体而言,在指数$c>1$的多项式特征值衰减假设下,遗憾为$\widetilde{O}(KT^{\frac{1}{2}(1+\frac{1}{c})})$,其中$T$表示回合数,$K$表示动作数。此外,当特征值衰减呈指数模式时,我们实现了更紧的遗憾界$\widetilde{O}(\sqrt{T})$。这些速率匹配所有已知下界的特殊情况下的下界,并匹配我们问题更受研究的随机对应问题中已知的最佳上界。