We consider the stochastic multi-armed bandit problem with non-stationary rewards. We present a novel formulation of non-stationarity in the environment where changes in the mean reward of the arms over time are due to some unknown, latent, auto-regressive (AR) state of order $k$. We call this new environment the latent AR bandit. Different forms of the latent AR bandit appear in many real-world settings, especially in emerging scientific fields such as behavioral health or education where there are few mechanistic models of the environment. If the AR order $k$ is known, we propose an algorithm that achieves $\tilde{O}(k\sqrt{T})$ regret in this setting. Empirically, our algorithm outperforms standard UCB across multiple non-stationary environments, even if $k$ is mis-specified.
翻译:我们研究具有非平稳奖励的随机多臂赌博机问题。本文提出了一种环境非平稳性的新颖表述,其中臂的平均奖励随时间变化是由某种未知的、潜在的$k$阶自回归状态所导致的。我们将这种新环境称为潜在自回归赌博机。不同形式的潜在自回归赌博机出现在许多现实场景中,特别是在行为健康或教育等新兴科学领域,这些领域的环境机制模型较为缺乏。若已知自回归阶数$k$,我们提出了一种在该环境下实现$\tilde{O}(k\sqrt{T})$遗憾的算法。实验表明,即使在$k$值设定错误的情况下,我们的算法在多种非平稳环境中的表现均优于标准UCB算法。