We study best-of-both-worlds algorithms for $K$-armed linear contextual bandits. Our algorithms deliver near-optimal regret bounds in both the adversarial and stochastic regimes, without prior knowledge about the environment. In the stochastic regime, we achieve the polylogarithmic rate $\frac{(dK)^2\mathrm{poly}\log(dKT)}{\Delta_{\min}}$, where $\Delta_{\min}$ is the minimum suboptimality gap over the $d$-dimensional context space. In the adversarial regime, we obtain either the first-order $\widetilde{O}(dK\sqrt{L^*})$ bound, or the second-order $\widetilde{O}(dK\sqrt{\Lambda^*})$ bound, where $L^*$ is the cumulative loss of the best action and $\Lambda^*$ is a notion of the cumulative second moment for the losses incurred by the algorithm. Moreover, we develop an algorithm based on FTRL with Shannon entropy regularizer that does not require the knowledge of the inverse of the covariance matrix, and achieves a polylogarithmic regret in the stochastic regime while obtaining $\widetilde{O}\big(dK\sqrt{T}\big)$ regret bounds in the adversarial regime.
翻译:我们研究了$K$臂线性上下文赌博机的最优双向算法。我们的算法在对抗性环境和随机环境下均能实现近乎最优的遗憾界,且无需事先了解环境信息。在随机环境下,我们达到了多对数速率$\frac{(dK)^2\mathrm{poly}\log(dKT)}{\Delta_{\min}}$,其中$\Delta_{\min}$是$d$维上下文空间上的最小次优间隙。在对抗性环境下,我们分别获得一阶界$\widetilde{O}(dK\sqrt{L^*})$或二阶界$\widetilde{O}(dK\sqrt{\Lambda^*})$,其中$L^*$是最优动作的累积损失,$\Lambda^*$是算法所产生损失累积二阶矩的一种度量。此外,我们开发了一种基于香农熵正则化FTRL的算法,该算法无需知晓协方差矩阵的逆,在随机环境下实现多对数遗憾,同时在对抗性环境下获得$\widetilde{O}\big(dK\sqrt{T}\big)$的遗憾界。