Non-stationary multi-armed bandit (NS-MAB) problems have recently received significant attention. NS-MAB are typically modelled in two scenarios: abruptly changing, where reward distributions remain constant for a certain period and change at unknown time steps, and smoothly changing, where reward distributions evolve smoothly based on unknown dynamics. In this paper, we propose Discounted Thompson Sampling (DS-TS) with Gaussian priors to address both non-stationary settings. Our algorithm passively adapts to changes by incorporating a discounted factor into Thompson Sampling. DS-TS method has been experimentally validated, but analysis of the regret upper bound is currently lacking. Under mild assumptions, we show that DS-TS with Gaussian priors can achieve nearly optimal regret bound on the order of $\tilde{O}(\sqrt{TB_T})$ for abruptly changing and $\tilde{O}(T^{\beta})$ for smoothly changing, where $T$ is the number of time steps, $B_T$ is the number of breakpoints, $\beta$ is associated with the smoothly changing environment and $\tilde{O}$ hides the parameters independent of $T$ as well as logarithmic terms. Furthermore, empirical comparisons between DS-TS and other non-stationary bandit algorithms demonstrate its competitive performance. Specifically, when prior knowledge of the maximum expected reward is available, DS-TS has the potential to outperform state-of-the-art algorithms.
翻译:非平稳多臂赌博机(NS-MAB)问题近期受到广泛关注。NS-MAB通常被建模为两种场景:突变场景,其中奖励分布在特定时段内保持恒定并在未知时间步发生改变;以及平滑变化场景,其中奖励分布基于未知动态平滑演化。本文提出基于高斯先验的折扣汤普森采样(DS-TS)算法以应对上述两种非平稳设置。该算法通过在汤普森采样中引入折扣因子被动适应变化。DS-TS方法已通过实验验证,但当前缺乏对其遗憾上界的理论分析。在温和假设条件下,我们证明基于高斯先验的DS-TS算法在突变场景下可实现接近最优的遗憾界 $\tilde{O}(\sqrt{TB_T})$,在平滑变化场景下可实现 $\tilde{O}(T^{\beta})$,其中 $T$ 为时间步数,$B_T$ 为断点数量,$\beta$ 与平滑变化环境相关,$\tilde{O}$ 隐藏了与 $T$ 无关的参数及对数项。此外,DS-TS与其他非平稳赌博机算法的实证比较展示了其竞争性能。特别地,当最大期望奖励的先验知识可用时,DS-TS具有超越现有最优算法的潜力。