We study the generalized linear contextual bandit problem within the requirements of limited adaptivity. In this paper, we present two algorithms, \texttt{B-GLinCB} and \texttt{RS-GLinCB}, that address, respectively, two prevalent limited adaptivity models: batch learning with stochastic contexts and rare policy switches with adversarial contexts. For both these models, we establish essentially tight regret bounds. Notably, in the obtained bounds, we manage to eliminate a dependence on a key parameter $\kappa$, which captures the non-linearity of the underlying reward model. For our batch learning algorithm \texttt{B-GLinCB}, with $\Omega\left( \log{\log T} \right)$ batches, the regret scales as $\tilde{O}(\sqrt{T})$. Further, we establish that our rarely switching algorithm \texttt{RS-GLinCB} updates its policy at most $\tilde{O}(\log^2 T)$ times and achieves a regret of $\tilde{O}(\sqrt{T})$. Our approach for removing the dependence on $\kappa$ for generalized linear contextual bandits might be of independent interest.
翻译:我们研究在有限自适应性约束下的广义线性上下文赌博机问题。本文提出了两种算法,\texttt{B-GLinCB} 和 \texttt{RS-GLinCB},分别针对两种主流的有限自适应性模型:随机上下文下的批次学习与对抗性上下文下的稀有策略切换。针对这两种模型,我们建立了本质上紧致的遗憾界。值得注意的是,在所得遗憾界中,我们成功消除了对刻画奖励模型非线性的关键参数 $\kappa$ 的依赖。对于批次学习算法 \texttt{B-GLinCB},在 $\Omega\left( \log{\log T} \right)$ 个批次下,遗憾量级为 $\tilde{O}(\sqrt{T})$。此外,我们证明稀有切换算法 \texttt{RS-GLinCB} 的策略更新次数不超过 $\tilde{O}(\log^2 T)$,并达到了 $\tilde{O}(\sqrt{T})$ 的遗憾界。本文消除广义线性上下文赌博机中 $\kappa$ 依赖性的方法可能具有独立的研究价值。