In many applications of online decision making, the environment is non-stationary and it is therefore crucial to use bandit algorithms that handle changes. Most existing approaches are designed to protect against non-smooth changes, constrained only by total variation or Lipschitzness over time, where they guarantee $T^{2/3}$ regret. However, in practice environments are often changing {\it smoothly}, so such algorithms may incur higher-than-necessary regret in these settings and do not leverage information on the {\it rate of change}. In this paper, we study a non-stationary two-arm bandit problem where we assume an arm's mean reward is a $\beta$-H\"older function over (normalized) time, meaning it is $(\beta-1)$-times Lipschitz-continuously differentiable. We show the first {\it separation} between the smooth and non-smooth regimes by presenting a policy with $T^{3/5}$ regret for $\beta=2$. We complement this result by a $T^{\frac{\beta+1}{2\beta+1}}$ lower bound for any integer $\beta\ge 1$, which matches our upper bound for $\beta=2$.
翻译:在许多在线决策的应用中,环境是非平稳的,因此必须使用能够应对变化的赌博机算法。现有方法大多旨在防范非光滑变化,仅受限于总变分或随时间变化的Lipschitz连续性,并能保证$T^{2/3}$的遗憾值。然而,实际环境通常以光滑方式变化,因此这类算法在此类场景下可能产生高于必要水平的遗憾,且无法利用变化率信息。本文研究了一个非平稳的双臂赌博机问题,假设每臂的平均奖励是归一化时间上的$\beta$-Hölder函数,即该函数具有$(\beta-1)$阶Lipschitz连续可微性。我们首次展示了光滑与非光滑机制之间的分离性,针对$\beta=2$的情况提出了一个遗憾值为$T^{3/5}$的策略。我们通过对于任意整数$\beta\ge 1$的$T^{\frac{\beta+1}{2\beta+1}}$下界对该结果进行了补充,该下界与我们在$\beta=2$时的上界相匹配。