We study a $K$-armed non-stationary bandit model where rewards change smoothly, as captured by H\"{o}lder class assumptions on rewards as functions of time. Such smooth changes are parametrized by a H\"{o}lder exponent $\beta$ and coefficient $\lambda$. While various sub-cases of this general model have been studied in isolation, we first establish the minimax dynamic regret rate generally for all $K,\beta,\lambda$. Next, we show this optimal dynamic regret can be attained adaptively, without knowledge of $\beta,\lambda$. To contrast, even with parameter knowledge, upper bounds were only previously known for limited regimes $\beta\leq 1$ and $\beta=2$ (Slivkins, 2014; Krishnamurthy and Gopalan, 2021; Manegueu et al., 2021; Jia et al.,2023). Thus, our work resolves open questions raised by these disparate threads of the literature. We also study the problem of attaining faster gap-dependent regret rates in non-stationary bandits. While such rates are long known to be impossible in general (Garivier and Moulines, 2011), we show that environments admitting a safe arm (Suk and Kpotufe, 2022) allow for much faster rates than the worst-case scaling with $\sqrt{T}$. While previous works in this direction focused on attaining the usual logarithmic regret bounds, as summed over stationary periods, our new gap-dependent rates reveal new optimistic regimes of non-stationarity where even the logarithmic bounds are pessimistic. We show our new gap-dependent rate is tight and that its achievability (i.e., as made possible by a safe arm) has a surprisingly simple and clean characterization within the smooth H\"{o}lder class model.
翻译:我们研究一个$K$臂非平稳赌博机模型,其中奖励随时间平滑变化,这一特性通过将奖励视为时间函数并施加H\"{o}lder类假设来刻画。此类平滑变化由H\"{o}lder指数$\beta$和系数$\lambda$参数化。虽然该通用模型的各个子案例此前已被孤立研究,我们首先针对所有$K,\beta,\lambda$参数建立了极小极大动态遗憾率。接着,我们证明这一最优动态遗憾可在无需知晓$\beta,\lambda$参数的情况下自适应地实现。作为对比,即使已知参数信息,先前研究仅对有限区间$\beta\leq 1$和$\beta=2$给出了上界结果(Slivkins, 2014; Krishnamurthy and Gopalan, 2021; Manegueu et al., 2021; Jia et al.,2023)。因此,我们的工作解决了这些分散文献脉络中提出的开放性问题。我们还研究了在非平稳赌博机中实现更快间隙依赖遗憾率的问题。尽管此类速率在一般情况下早已被证明无法实现(Garivier and Moulines, 2011),我们证明了存在安全臂的环境(Suk and Kpotufe, 2022)能够获得比最坏情况$\sqrt{T}$缩放更快的速率。虽然该方向的先前工作主要关注实现传统对数遗憾界(按平稳周期求和),我们提出的新间隙依赖速率揭示了非平稳性中新的乐观机制——即使对数界在此类机制中也显得保守。我们证明新的间隙依赖速率是紧致的,并且其可实现性(即通过安全臂实现)在平滑H\"{o}lder类模型中具有出人意料简洁而清晰的特征表征。