We consider the stochastic multi-armed bandit problem with non-stationary rewards. We present a novel formulation of non-stationarity in the environment where changes in the mean reward of the arms over time are due to some unknown, latent, auto-regressive (AR) state of order $k$. We call this new environment the latent AR bandit. Different forms of the latent AR bandit appear in many real-world settings, especially in emerging scientific fields such as behavioral health or education where there are few mechanistic models of the environment. If the AR order $k$ is known, we propose an algorithm that achieves $\tilde{O}(k\sqrt{T})$ regret in this setting. Empirically, our algorithm outperforms standard UCB across multiple non-stationary environments, even if $k$ is mis-specified.
翻译:我们考虑具有非平稳奖励的随机多臂赌博机问题。我们提出了一种非平稳环境的新颖建模方法,其中各臂的平均奖励随时间的变化源于某些未知的、潜在的、阶数为$k$的自回归状态。我们将这种新环境称为潜在自回归赌博机。潜在自回归赌博机的不同形式出现在许多实际场景中,特别是在行为健康或教育等新兴科学领域,这些领域缺乏环境的机理模型。如果自回归阶数$k$已知,我们提出了一种算法,在该设定下实现了$\tilde{O}(k\sqrt{T})$的遗憾。实验表明,即使$k$被错误设定,我们的算法在多个非平稳环境中也优于标准的上置信界算法。