In this paper, we study the behavior of the Upper Confidence Bound-Variance (UCB-V) algorithm for Multi-Armed Bandit (MAB) problems, a variant of the canonical Upper Confidence Bound (UCB) algorithm that incorporates variance estimates into its decision-making process. More precisely, we provide an asymptotic characterization of the arm-pulling rates of UCB-V, extending recent results for the canonical UCB in Kalvit and Zeevi (2021) and Khamaru and Zhang (2024). In an interesting contrast to the canonical UCB, we show that the behavior of UCB-V can exhibit instability, meaning that the arm-pulling rates may not always be asymptotically deterministic. Besides the asymptotic characterization, we also provide non-asymptotic bounds for arm-pulling rates in the high probability regime, offering insights into regret analysis. As an application of this high probability result, we show that UCB-V can achieve a refined regret bound, previously unknown even for more complicate and advanced variance-aware online decision-making algorithms.
翻译:本文研究了多臂老虎机问题中方差感知上置信界算法的行为,该算法是经典上置信界算法的一个变体,在其决策过程中引入了方差估计。具体而言,我们给出了UCB-V算法臂选择次数的渐近刻画,扩展了Kalvit与Zeevi(2021)以及Khamaru与Zhang(2024)关于经典UCB算法的最新结果。与经典UCB算法形成有趣对比的是,我们发现UCB-V的行为可能表现出不稳定性,即其臂选择次数并非总是渐近确定的。除了渐近刻画外,我们还给出了高概率情形下臂选择次数的非渐近界,为遗憾分析提供了新的视角。作为该高概率结果的一个应用,我们证明UCB-V能够达到一种精细化的遗憾界,这一结论此前即使在更复杂、更先进的方差感知在线决策算法中也尚未被揭示。