Non-stationary parametric bandits have attracted much attention recently. There are three principled ways to deal with non-stationarity, including sliding-window, weighted, and restart strategies. As many non-stationary environments exhibit gradual drifting patterns, the weighted strategy is commonly adopted in real-world applications. However, previous theoretical studies show that its analysis is more involved and the algorithms are either computationally less efficient or statistically suboptimal. This paper revisits the weighted strategy for non-stationary parametric bandits. In linear bandits (LB), we discover that this undesirable feature is due to an inadequate regret analysis, which results in an overly complex algorithm design. We propose a refined analysis framework, which simplifies the derivation and importantly produces a simpler weight-based algorithm that is as efficient as window/restart-based algorithms while retaining the same regret as previous studies. Furthermore, our new framework can be used to improve regret bounds of other parametric bandits, including Generalized Linear Bandits (GLB) and Self-Concordant Bandits (SCB). For example, we develop a simple weighted GLB algorithm with an $\widetilde{O}(k_\mu^{\frac{5}{4}} c_\mu^{-\frac{3}{4}} d^{\frac{3}{4}} P_T^{\frac{1}{4}}T^{\frac{3}{4}})$ regret, improving the $\widetilde{O}(k_\mu^{2} c_\mu^{-1}d^{\frac{9}{10}} P_T^{\frac{1}{5}}T^{\frac{4}{5}})$ bound in prior work, where $k_\mu$ and $c_\mu$ characterize the reward model's nonlinearity, $P_T$ measures the non-stationarity, $d$ and $T$ denote the dimension and time horizon.
翻译:非平稳参数化赌博机近年来备受关注。应对非平稳性主要有三种原则性策略:滑动窗口、加权和重启策略。由于许多非平稳环境呈现渐进漂移模式,加权策略在实际应用中被广泛采用。然而,先前的理论研究表明,其分析更为复杂,且相应算法在计算效率或统计最优性上存在不足。本文重访了非平稳参数化赌博机的加权策略。在线性赌博机(LB)中,我们发现这一不足源于不当的遗憾分析,导致算法设计过于复杂。我们提出了一种精细化的分析框架,简化了推导过程,并重点设计了一种更简单的基于权重的算法,该算法在保持与先前研究相同遗憾的同时,实现了与窗口/重启类算法相当的效率。此外,我们的新框架还可用于改进其他参数化赌博机(包括广义线性赌博机(GLB)和自和谐赌博机(SCB))的遗憾界。例如,我们开发了一种简单的加权GLB算法,其遗憾界为$\widetilde{O}(k_\mu^{\frac{5}{4}} c_\mu^{-\frac{3}{4}} d^{\frac{3}{4}} P_T^{\frac{1}{4}}T^{\frac{3}{4}})$,优于先前工作中的$\widetilde{O}(k_\mu^{2} c_\mu^{-1}d^{\frac{9}{10}} P_T^{\frac{1}{5}}T^{\frac{4}{5}})$界,其中$k_\mu$和$c_\mu$刻画奖励模型的非线性度,$P_T$度量非平稳性,$d$和$T$分别表示维度和时间范围。