We study high-probability regret bounds for adversarial $K$-armed bandits with time-varying feedback graphs over $T$ rounds. For general strongly observable graphs, we develop an algorithm that achieves the optimal regret $\widetilde{\mathcal{O}}((\sum_{t=1}^T\alpha_t)^{1/2}+\max_{t\in[T]}\alpha_t)$ with high probability, where $\alpha_t$ is the independence number of the feedback graph at round $t$. Compared to the best existing result [Neu, 2015] which only considers graphs with self-loops for all nodes, our result not only holds more generally, but importantly also removes any $\text{poly}(K)$ dependence that can be prohibitively large for applications such as contextual bandits. Furthermore, we also develop the first algorithm that achieves the optimal high-probability regret bound for weakly observable graphs, which even improves the best expected regret bound of [Alon et al., 2015] by removing the $\mathcal{O}(\sqrt{KT})$ term with a refined analysis. Our algorithms are based on the online mirror descent framework, but importantly with an innovative combination of several techniques. Notably, while earlier works use optimistic biased loss estimators for achieving high-probability bounds, we find it important to use a pessimistic one for nodes without self-loop in a strongly observable graph.
翻译:我们研究在$T$轮次中,具有时变反馈图的对抗性$K$臂赌博机的高概率后悔界。对于一般强可观测图,我们开发了一种算法,能以高概率达到最优后悔$\widetilde{\mathcal{O}}((\sum_{t=1}^T\alpha_t)^{1/2}+\max_{t\in[T]}\alpha_t)$,其中$\alpha_t$是第$t$轮反馈图的独立数。与现有最佳结果[Neu, 2015](仅考虑所有节点均含自环的图)相比,我们的结果不仅更通用,更重要的是去除了可能对上下文赌博机等应用造成巨大影响的$\text{poly}(K)$依赖项。此外,我们还首次提出一种算法,为弱可观测图达到最优高概率后悔界,该算法通过精细分析去除了[Alon et al., 2015]中的$\mathcal{O}(\sqrt{KT})$项,甚至改进了其期望后悔界。我们的算法基于在线镜像下降框架,但关键之处在于创新性地结合了多种技术。值得注意的是,早期工作使用乐观有偏损失估计量来实现高概率界,而我们发现在强可观测图中,对于不含自环的节点采用悲观估计量至关重要。