Recent studies have shown that reinforcement learning with KL-regularized objectives can enjoy faster rates of convergence or logarithmic regret, in contrast to the classical $\sqrt{T}$-type regret in the unregularized setting. However, the statistical efficiency of online learning with respect to KL-regularized objectives remains far from completely characterized, even when specialized to multi-armed bandits (MABs). We address this problem for MABs via a sharp analysis of KL-UCB using a novel peeling argument, which yields a $\tilde{O}(ηK\log^2T)$ upper bound: the first high-probability regret bound with linear dependence on $K$. Here, $T$ is the time horizon, $K$ is the number of arms, $η^{-1}$ is the regularization intensity, and $\tilde{O}$ hides all logarithmic factors except those involving $\log T$. The near-tightness of our analysis is certified by the first non-constant lower bound $Ω(ηK \log T)$, which follows from subtle hard-instance constructions and a tailored decomposition of the Bayes prior. Moreover, in the low-regularization regime (i.e., large $η$), we show that the KL-regularized regret for MABs is $η$-independent and scales as $\tildeΘ(\sqrt{KT})$. Overall, our results provide a thorough understanding of KL-regularized MABs across all regimes of $η$ and yield nearly optimal bounds in terms of $K$, $η$, and $T$.
翻译:近期研究表明,与无正则化设定中经典的$\sqrt{T}$型遗憾相比,采用KL正则化目标的强化学习能够获得更快的收敛速度或对数遗憾。然而,即使专门针对多臂老虎机问题,在线学习在KL正则化目标下的统计效率仍远未得到完整刻画。我们通过对KL-UCB算法采用新颖的逐层分析技术进行精确分析,针对MAB问题解决了这一难题,得到了$\tilde{O}(ηK\log^2T)$上界:这是首个具有$K$线性依赖性的高概率遗憾界。其中$T$为时间范围,$K$为臂的数量,$η^{-1}$表示正则化强度,$\tilde{O}$隐藏了除涉及$\log T$外的所有对数因子。我们通过首个非常数下界$Ω(ηK \log T)$证明了分析的近紧性,该下界源于精妙的困难实例构造和贝叶斯先验的定制化分解。此外,在低正则化区域(即$η$较大时),我们证明MAB的KL正则化遗憾具有$η$无关性,且按$\tildeΘ(\sqrt{KT})$缩放。总体而言,我们的研究结果提供了对KL正则化MAB在所有$η$区域的全面理解,并在$K$、$η$和$T$维度上给出了近乎最优的界。