UCB algorithms for multi-armed bandits: Precise regret and adaptive inference

Upper Confidence Bound (UCB) algorithms are a widely-used class of sequential algorithms for the $K$-armed bandit problem. Despite extensive research over the past decades aimed at understanding their asymptotic and (near) minimax optimality properties, a precise understanding of their regret behavior remains elusive. This gap has not only hindered the evaluation of their actual algorithmic efficiency, but also limited further developments in statistical inference in sequential data collection. This paper bridges these two fundamental aspects--precise regret analysis and adaptive statistical inference--through a deterministic characterization of the number of arm pulls for an UCB index algorithm [Lai87, Agr95, ACBF02]. Our resulting precise regret formula not only accurately captures the actual behavior of the UCB algorithm for finite time horizons and individual problem instances, but also provides significant new insights into the regimes in which the existing theory remains informative. In particular, we show that the classical Lai-Robbins regret formula is exact if and only if the sub-optimality gaps exceed the order $\sigma\sqrt{K\log T/T}$. We also show that its maximal regret deviates from the minimax regret by a logarithmic factor, and therefore settling its strict minimax optimality in the negative. The deterministic characterization of the number of arm pulls for the UCB algorithm also has major implications in adaptive statistical inference. Building on the seminal work of [Lai82], we show that the UCB algorithm satisfies certain stability properties that lead to quantitative central limit theorems in two settings including the empirical means of unknown rewards in the bandit setting. These results have an important practical implication: conventional confidence sets designed for i.i.d. data remain valid even when data are collected sequentially.

翻译：上置信界（UCB）算法是解决$K$臂老虎机问题的一类被广泛使用的序列算法。尽管过去数十年已有大量研究致力于理解其渐近性与（近似）极小极大最优性，但对其遗憾行为的精确理解仍然不足。这一空白不仅阻碍了对其实际算法效率的评估，也限制了序列数据收集中统计推断方法的进一步发展。本文通过对一种UCB指标算法[Lai87, Agr95, ACBF02]的臂选择次数进行确定性刻画，将精确遗憾分析与自适应统计推断这两个基本方面联系起来。我们得到的精确遗憾公式不仅准确捕捉了UCB算法在有限时间范围及具体问题实例中的实际行为，还为现有理论仍具有指导意义的机制提供了重要的新见解。特别地，我们证明经典的Lai-Robbins遗憾公式当且仅当次优间隙超过$\sigma\sqrt{K\log T/T}$量级时才是精确的。我们还证明其最大遗憾与极小极大遗憾相差一个对数因子，从而从否定方向确定了其严格极小极大最优性。UCB算法臂选择次数的确定性刻画对自适应统计推断也具有重要影响。基于[Lai82]的开创性工作，我们证明UCB算法满足特定的稳定性条件，这些条件可在两种设定（包括老虎机设定中未知奖励的经验均值）下导出定量的中心极限定理。这些结果具有重要的实际意义：即使数据是顺序收集的，为独立同分布数据设计的传统置信集仍然保持有效。