We study best-of-both-worlds algorithms for $K$-armed linear contextual bandits. Our algorithms deliver near-optimal regret bounds in both the adversarial and stochastic regimes, without prior knowledge about the environment. In the stochastic regime, we achieve the polylogarithmic rate $\frac{(dK)^2\mathrm{poly}\log(dKT)}{\Delta_{\min}}$, where $\Delta_{\min}$ is the minimum suboptimality gap over the $d$-dimensional context space. In the adversarial regime, we obtain either the first-order $\widetilde{O}(dK\sqrt{L^*})$ bound, or the second-order $\widetilde{O}(dK\sqrt{\Lambda^*})$ bound, where $L^*$ is the cumulative loss of the best action and $\Lambda^*$ is a notion of the cumulative second moment for the losses incurred by the algorithm. Moreover, we develop an algorithm based on FTRL with Shannon entropy regularizer that does not require the knowledge of the inverse of the covariance matrix, and achieves a polylogarithmic regret in the stochastic regime while obtaining $\widetilde{O}\big(dK\sqrt{T}\big)$ regret bounds in the adversarial regime.
翻译:我们研究了$K$臂线性上下文赌博机的最优双域算法。我们的算法在对抗性与随机环境中均能实现接近最优的遗憾界,且无需预先知晓环境信息。在随机环境中,我们实现了对数多项式速率$\frac{(dK)^2\mathrm{poly}\log(dKT)}{\Delta_{\min}}$,其中$\Delta_{\min}$是$d$维上下文空间上的最小次优性间隙。在对抗性环境中,我们获得了一阶$\widetilde{O}(dK\sqrt{L^*})$界或二阶$\widetilde{O}(dK\sqrt{\Lambda^*})$界,其中$L^*$是最优行动的累积损失,$\Lambda^*$是算法产生损失的累积二阶矩概念。此外,我们开发了一种基于FTRL与香农熵正则化的算法,该算法无需知晓协方差矩阵的逆,在随机环境中实现对数多项式遗憾,同时在对抗环境中获得$\widetilde{O}\big(dK\sqrt{T}\big)$遗憾界。