We consider the adversarial linear contextual bandit problem, where the loss vectors are selected fully adversarially and the per-round action set (i.e. the context) is drawn from a fixed distribution. Existing methods for this problem either require access to a simulator to generate free i.i.d. contexts, achieve a sub-optimal regret no better than $\widetilde{O}(T^{\frac{5}{6}})$, or are computationally inefficient. We greatly improve these results by achieving a regret of $\widetilde{O}(\sqrt{T})$ without a simulator, while maintaining computational efficiency when the action set in each round is small. In the special case of sleeping bandits with adversarial loss and stochastic arm availability, our result answers affirmatively the open question by Saha et al. [2020] on whether there exists a polynomial-time algorithm with $poly(d)\sqrt{T}$ regret. Our approach naturally handles the case where the loss is linear up to an additive misspecification error, and our regret shows near-optimal dependence on the magnitude of the error.
翻译:我们考虑对抗性线性情境赌博机问题,其中损失向量完全由对手设定,每轮的动作集(即情境)从一个固定分布中抽取。现有方法要么需要访问模拟器以生成独立的同分布情境,要么只能达到次优的遗憾界$\widetilde{O}(T^{\frac{5}{6}})$,要么计算效率低下。我们大幅改进了这些结果,在无需模拟器的条件下实现了$\widetilde{O}(\sqrt{T})$的遗憾,同时当每轮动作集规模较小时保持计算效率。针对损失对抗性且臂可用性随机的休眠赌博机特例,我们的结果肯定地回答了Saha等人[2020]提出的开放性问题:是否存在具有$poly(d)\sqrt{T}$遗憾的多项式时间算法。我们的方法自然适用于损失在加性误规范误差范围内呈线性的情形,且得到的遗憾关于误差量级具有近乎最优的依赖关系。