The convergence analysis of online learning algorithms is central to machine learning theory, where the last-iterate convergence is particularly important, as it captures the learner's actual decisions and describes the evolution of the learning process over time. However, in multi-armed bandits, most existing algorithmic analyses mainly focus on the order of regret, while the last-iterate (simple regret) convergence rate remains less explored -- especially for the widely studied Follow-the-Regularized-Leader (FTRL) algorithms. Recently, FTRL with the $1/2$-Tsallis entropy regularizer $Ψ(p) = -4\sum_{i=1}^d \sqrt{p_i}$ (the $1/2$-Tsallis-INF algorithm, by arXiv:1807.07623) was shown to achieve logarithmic regret in stochastic bandits. Nevertheless, its last-iterate convergence rate has not yet been studied. Intuitively, logarithmic regret should correspond to a $t^{-1}$ last-iterate convergence rate. This paper studies the $1/2$-Tsallis-INF algorithm and partially confirms this intuition through theoretical analysis, showing that the Bregman divergence, defined by $Ψ(p)$, between the point mass on the optimal arm and the probability distribution over the arm set obtained at iteration $t$, decays at a rate of $t^{-1/2}$.
翻译:在线学习算法的收敛分析是机器学习理论的核心问题,其中最后迭代收敛尤为重要,因为它刻画了学习器实际做出的决策并描述了学习过程随时间的演化。然而,在多臂Bandits问题中,现有算法分析主要关注遗憾的上界,而最后迭代(简单遗憾)的收敛速率却较少被探索——尤其是对于广泛研究的FTRL(Follow-the-Regularized-Leader)算法。近期,采用$1/2$-Tsallis熵正则化器$Ψ(p) = -4\sum_{i=1}^d \sqrt{p_i}$的FTRL算法(即$1/2$-Tsallis-INF算法,参见arXiv:1807.07623)被证明在随机Bandits中可实现对数遗憾。尽管如此,其最后迭代收敛速率尚未被研究。直观上,对数遗憾应对应于$t^{-1}$的最后迭代收敛速率。本文通过理论分析部分证实了这一直觉,研究表明:由$Ψ(p)$定义的Bregman散度——即最优臂的点质量与第$t$次迭代获得的臂集概率分布之间的差异——以$t^{-1/2}$的速率衰减。