The stochastic multi-armed bandit (MAB) problem is one of the most fundamental models in sequential decision-making, with the core challenge being the trade-off between exploration and exploitation. Although algorithms such as Upper Confidence Bound (UCB) and Thompson Sampling, along with their regret theories, are well-established, existing analyses primarily operate from a time-domain and cumulative regret perspective, struggling to characterize the dynamic nature of the learning process. This paper proposes a novel frequency-domain analysis framework, reformulating the bandit process as a signal processing problem. Within this framework, the reward estimate of each arm is viewed as a spectral component, with its uncertainty corresponding to the component's frequency, and the bandit algorithm is interpreted as an adaptive filter. We construct a formal Frequency-Domain Bandit Model and prove the main theorem: the confidence bound term in the UCB algorithm is equivalent in the frequency domain to a time-varying gain applied to uncertain spectral components, a gain inversely proportional to the square root of the visit count. Based on this, we further derive finite-time dynamic bounds concerning the exploration rate decay. This theory not only provides a novel and intuitive physical interpretation for classical algorithms but also lays a rigorous theoretical foundation for designing next-generation algorithms with adaptive parameter adjustment.
翻译:随机多臂赌博机(MAB)问题是序贯决策中最基础的模型之一,其核心挑战在于探索与利用之间的权衡。尽管上置信界(UCB)算法、汤普森采样及其遗憾理论已相当成熟,但现有分析主要从时域和累积遗憾的视角展开,难以刻画学习过程的动态特性。本文提出了一种新颖的频域分析框架,将赌博机过程重新表述为一个信号处理问题。在此框架中,每个臂的奖励估计值被视为频谱分量,其不确定性对应于该分量的频率,而赌博机算法则被解释为自适应滤波器。我们构建了形式化的频域赌博机模型,并证明了主要定理:UCB算法中的置信界项在频域中等效于对不确定频谱分量施加的时变增益,该增益与访问次数的平方根成反比。基于此,我们进一步推导了关于探索率衰减的有限时间动态界。该理论不仅为经典算法提供了新颖且直观的物理解释,也为设计具有自适应参数调整能力的下一代算法奠定了严格的理论基础。