Thompson sampling has proven effective across a wide range of stationary bandit environments. However, as we demonstrate in this paper, it can perform poorly when applied to non-stationary environments. We show that such failures are attributed to the fact that, when exploring, the algorithm does not differentiate actions based on how quickly the information acquired loses its usefulness due to non-stationarity. Building upon this insight, we propose predictive sampling, an algorithm that deprioritizes acquiring information that quickly loses usefulness. Theoretical guarantee on the performance of predictive sampling is established through a Bayesian regret bound. We provide versions of predictive sampling for which computations tractably scale to complex bandit environments of practical interest. Through numerical simulations, we demonstrate that predictive sampling outperforms Thompson sampling in all non-stationary environments examined.
翻译:汤普森采样已被证明在广泛的平稳赌博机环境中有效。然而,正如本文所展示的,当应用于非平稳环境时,其表现可能不佳。我们表明,此类失败归因于这样一个事实:在探索过程中,该算法无法根据非平稳性导致所获取信息失效的速度来区分动作。基于这一见解,我们提出了预测采样算法,该算法会降低获取快速失效信息的优先级。通过贝叶斯遗憾界,我们建立了预测采样性能的理论保证。我们提供了预测采样的多个版本,其计算能够高效扩展到实际感兴趣的复杂赌博机环境。通过数值模拟,我们证明预测采样在所有考察的非平稳环境中均优于汤普森采样。