We study the problem of non-stationary Lipschitz bandits, where the number of actions is infinite and the reward function, satisfying a Lipschitz assumption, can change arbitrarily over time. We design an algorithm that adaptively tracks the recently introduced notion of significant shifts, defined by large deviations of the cumulative reward function. To detect such reward changes, our algorithm leverages a hierarchical discretization of the action space. Without requiring any prior knowledge of the non-stationarity, our algorithm achieves a minimax-optimal dynamic regret bound of $\mathcal{\widetilde{O}}(\tilde{L}^{1/3}T^{2/3})$, where $\tilde{L}$ is the number of significant shifts and $T$ the horizon. This result provides the first optimal guarantee in this setting.
翻译:本文研究非平稳Lipschitz赌博机问题,其中动作空间为无限集,且满足Lipschitz条件的奖励函数可随时间任意变化。我们设计了一种自适应算法来追踪近期提出的显著漂移概念,该概念由累积奖励函数的大幅偏差定义。为检测此类奖励变化,我们的算法利用动作空间的分层离散化结构。在无需任何非平稳性先验知识的情况下,该算法实现了$\mathcal{\widetilde{O}}(\tilde{L}^{1/3}T^{2/3})$的极小极大最优动态遗憾界,其中$\tilde{L}$表示显著漂移次数,$T$为时间范围。该结果首次在此设定下提供了最优保证。