Non-stationary multi-armed bandit (NS-MAB) problems have recently received significant attention. NS-MAB are typically modelled in two scenarios: abruptly changing, where reward distributions remain constant for a certain period and change at unknown time steps, and smoothly changing, where reward distributions evolve smoothly based on unknown dynamics. In this paper, we propose Discounted Thompson Sampling (DS-TS) with Gaussian priors to address both non-stationary settings. Our algorithm passively adapts to changes by incorporating a discounted factor into Thompson Sampling. DS-TS method has been experimentally validated, but analysis of the regret upper bound is currently lacking. Under mild assumptions, we show that DS-TS with Gaussian priors can achieve nearly optimal regret bound on the order of $\tilde{O}(\sqrt{TB_T})$ for abruptly changing and $\tilde{O}(T^{\beta})$ for smoothly changing, where $T$ is the number of time steps, $B_T$ is the number of breakpoints, $\beta$ is associated with the smoothly changing environment and $\tilde{O}$ hides the parameters independent of $T$ as well as logarithmic terms. Furthermore, empirical comparisons between DS-TS and other non-stationary bandit algorithms demonstrate its competitive performance. Specifically, when prior knowledge of the maximum expected reward is available, DS-TS has the potential to outperform state-of-the-art algorithms.
翻译:非平稳多臂老虎机(NS-MAB)问题近年来受到广泛关注。NS-MAB通常通过两种场景建模:突变型场景,其中奖励分布在特定时段内保持恒定,并在未知时间步发生突变;平滑变化型场景,其中奖励分布基于未知动态平滑演化。本文提出基于高斯先验的折扣汤普森采样(DS-TS)算法,以同时应对这两种非平稳设置。该算法通过将折扣因子引入汤普森采样实现被动适应变化。DS-TS方法已得到实验验证,但目前缺乏遗憾上界分析。在温和假设下,我们证明基于高斯先验的DS-TS可在突变型场景下达到近优的遗憾界$\tilde{O}(\sqrt{TB_T})$,并在平滑变化型场景下达到$\tilde{O}(T^{\beta})$,其中$T$为时间步数,$B_T$为断点数量,$\beta$与平滑变化环境相关,$\tilde{O}$隐藏了与$T$无关的参数及对数项。此外,DS-TS与其他非平稳老虎机算法的实证对比显示其具备竞争性性能。特别地,当已知最大期望奖励的先验信息时,DS-TS具有超越当前最优算法的潜力。