Revisiting Weighted Strategy for Non-stationary Parametric Bandits

Non-stationary parametric bandits have attracted much attention recently. There are three principled ways to deal with non-stationarity, including sliding-window, weighted, and restart strategies. As many non-stationary environments exhibit gradual drifting patterns, the weighted strategy is commonly adopted in real-world applications. However, previous theoretical studies show that its analysis is more involved and the algorithms are either computationally less efficient or statistically suboptimal. This paper revisits the weighted strategy for non-stationary parametric bandits. In linear bandits (LB), we discover that this undesirable feature is due to an inadequate regret analysis, which results in an overly complex algorithm design. We propose a refined analysis framework, which simplifies the derivation and importantly produces a simpler weight-based algorithm that is as efficient as window/restart-based algorithms while retaining the same regret as previous studies. Furthermore, our new framework can be used to improve regret bounds of other parametric bandits, including Generalized Linear Bandits (GLB) and Self-Concordant Bandits (SCB). For example, we develop a simple weighted GLB algorithm with an $\widetilde{O}(k_\mu^{\frac{5}{4}} c_\mu^{-\frac{3}{4}} d^{\frac{3}{4}} P_T^{\frac{1}{4}}T^{\frac{3}{4}})$ regret, improving the $\widetilde{O}(k_\mu^{2} c_\mu^{-1}d^{\frac{9}{10}} P_T^{\frac{1}{5}}T^{\frac{4}{5}})$ bound in prior work, where $k_\mu$ and $c_\mu$ characterize the reward model's nonlinearity, $P_T$ measures the non-stationarity, $d$ and $T$ denote the dimension and time horizon.

翻译：非平稳参数化赌博机近期引起了广泛关注。处理非平稳性有三种原则性方法，包括滑动窗口法、加权法和重启法。由于许多非平稳环境呈现渐进漂移模式，加权策略在实际应用中被普遍采用。然而，以往的理论研究表明，其分析过程更为复杂，且相应算法要么计算效率较低，要么统计性能次优。本文重新审视了非平稳参数化赌博机的加权策略。在线性赌博机（LB）中，我们发现这一不良特征源于不充分的遗憾分析，导致算法设计过于复杂。我们提出了一种精炼的分析框架，简化了推导过程，更重要的是生成了一种更简单的基于权重的算法，该算法与基于窗口/重启的算法同样高效，同时保持与先前研究相同的遗憾界。此外，我们的新框架可用于改进其他参数化赌博机的遗憾界，包括广义线性赌博机（GLB）和自协和赌博机（SCB）。例如，我们开发了一种简单的加权GLB算法，其遗憾界为$\widetilde{O}(k_\mu^{\frac{5}{4}} c_\mu^{-\frac{3}{4}} d^{\frac{3}{4}} P_T^{\frac{1}{4}}T^{\frac{3}{4}})$，优于先前工作中$\widetilde{O}(k_\mu^{2} c_\mu^{-1}d^{\frac{9}{10}} P_T^{\frac{1}{5}}T^{\frac{4}{5}})$的界，其中$k_\mu$和$c_\mu$刻画了奖励模型的非线性，$P_T$衡量非平稳性，$d$和$T$分别表示维度和时间范围。