In many applications of online decision making, the environment is non-stationary and it is therefore crucial to use bandit algorithms that handle changes. Most existing approaches are designed to protect against non-smooth changes, constrained only by total variation or Lipschitzness over time, where they guarantee $\tilde \Theta(T^{2/3})$ regret. However, in practice environments are often changing {\bf smoothly}, so such algorithms may incur higher-than-necessary regret in these settings and do not leverage information on the rate of change. We study a non-stationary two-armed bandits problem where we assume that an arm's mean reward is a $\beta$-H\"older function over (normalized) time, meaning it is $(\beta-1)$-times Lipschitz-continuously differentiable. We show the first separation between the smooth and non-smooth regimes by presenting a policy with $\tilde O(T^{3/5})$ regret for $\beta=2$. We complement this result by an $\Omg(T^{(\beta+1)/(2\beta+1)})$ lower bound for any integer $\beta\ge 1$, which matches our upper bound for $\beta=2$.
翻译:在在线决策的许多应用中,环境是非平稳的,因此使用能应对变化的赌博机算法至关重要。现有的大多数方法旨在防范非平滑变化,仅受总变差或时间上的Lipschitz连续性约束,并保证$\tilde \Theta(T^{2/3})$的遗憾。然而,实际环境中变化往往是{\bf平滑}的,因此此类算法在这些场景下可能产生高于必要的遗憾,且未能利用变化率的信息。我们研究了一个非平稳双臂赌博机问题,其中假设一个臂的平均奖励是(归一化)时间上的$\beta$-H\"older函数,即其$(\beta-1)$阶Lipschitz连续可微。通过提出一种在$\beta=2$时具有$\tilde O(T^{3/5})$遗憾的策略,我们首次展示了平滑与非平滑机制之间的分离。我们通过对于任意整数$\beta\ge 1$的下界$\Omg(T^{(\beta+1)/(2\beta+1)})$来补充这一结果,该下界在$\beta=2$时与我们的上界相匹配。