Non-stationary parametric bandits have attracted much attention recently. There are three principled ways to deal with non-stationarity, including sliding-window, weighted, and restart strategies. As many non-stationary environments exhibit gradual drifting patterns, the weighted strategy is commonly adopted in real-world applications. However, previous theoretical studies show that its analysis is more involved and the algorithms are either computationally less efficient or statistically suboptimal. This paper revisits the weighted strategy for non-stationary parametric bandits. In linear bandits (LB), we discover that this undesirable feature is due to an inadequate regret analysis, which results in an overly complex algorithm design. We propose a \emph{refined analysis framework}, which simplifies the derivation and, importantly, produces a simpler weight-based algorithm that is as efficient as window/restart-based algorithms while retaining the same regret as previous studies. Furthermore, our new framework can be used to improve regret bounds of other parametric bandits, including Generalized Linear Bandits (GLB) and Self-Concordant Bandits (SCB). For example, we develop a simple weighted GLB algorithm with an $\tilde{O}(k_μ^{5/4} c_μ^{-3/4} d^{3/4} P_T^{1/4}T^{3/4})$ regret, improving the $\tilde{O}(k_μ^{2} c_μ^{-1}d^{9/10} P_T^{1/5}T^{4/5})$ bound in prior work, where $k_μ$ and $c_μ$ characterize the reward model's nonlinearity, $P_T$ measures the non-stationarity, $d$ and $T$ denote the dimension and time horizon. Moreover, we extend our framework to non-stationary Markov Decision Processes (MDPs) with function approximation, focusing on Linear Mixture MDP and Multinomial Logit (MNL) Mixture MDP. For both classes, we propose algorithms based on the weighted strategy and establish dynamic regret guarantees using our analysis framework.
翻译:非平稳参数化赌博机近来备受关注。处理非平稳性有三种原则性方法,包括滑动窗口策略、加权策略和重启策略。由于许多非平稳环境表现出渐进的漂移模式,加权策略在实际应用中常被采用。然而,先前的理论研究表明其分析更为复杂,且相关算法要么计算效率较低,要么统计上非最优。本文重新审视了非平稳参数化赌博机中的加权策略。在线性赌博机中,我们发现这一不理想特性源于不充分的遗憾分析,这导致了过于复杂的算法设计。我们提出了一个**精细化分析框架**,该框架简化了推导过程,并且重要的是,产生了一个更简单的基于权重的算法,其效率与基于窗口/重启的算法相当,同时保持了与先前研究相同的遗憾界。此外,我们的新框架可用于改进其他参数化赌博机的遗憾界,包括广义线性赌博机和自协调赌博机。例如,我们开发了一个简单的加权广义线性赌博机算法,其遗憾为 $\tilde{O}(k_μ^{5/4} c_μ^{-3/4} d^{3/4} P_T^{1/4}T^{3/4})$,改进了先前工作中 $\tilde{O}(k_μ^{2} c_μ^{-1}d^{9/10} P_T^{1/5}T^{4/5})$ 的界限,其中 $k_μ$ 和 $c_μ$ 表征了奖励模型的非线性,$P_T$ 衡量了非平稳性,$d$ 和 $T$ 分别表示维度和时间范围。此外,我们将框架扩展到具有函数逼近的非平稳马尔可夫决策过程,重点关注线性混合MDP和多项Logit混合MDP。针对这两类模型,我们提出了基于加权策略的算法,并利用我们的分析框架建立了动态遗憾保证。