We study the problem of designing adaptive multi-armed bandit algorithms that perform optimally in both the stochastic setting and the adversarial setting simultaneously (often known as a best-of-both-world guarantee). A line of recent works shows that when configured and analyzed properly, the Follow-the-Regularized-Leader (FTRL) algorithm, originally designed for the adversarial setting, can in fact optimally adapt to the stochastic setting as well. Such results, however, critically rely on an assumption that there exists one unique optimal arm. Recently, Ito (2021) took the first step to remove such an undesirable uniqueness assumption for one particular FTRL algorithm with the $\frac{1}{2}$-Tsallis entropy regularizer. In this work, we significantly improve and generalize this result, showing that uniqueness is unnecessary for FTRL with a broad family of regularizers and a new learning rate schedule. For some regularizers, our regret bounds also improve upon prior results even when uniqueness holds. We further provide an application of our results to the decoupled exploration and exploitation problem, demonstrating that our techniques are broadly applicable.
翻译:我们研究设计自适应多臂老虎机算法的问题,该算法需同时在随机环境和对抗环境中实现最优性能(通常称为“两全其美”保证)。近期一系列研究表明,当正确配置并分析时,原本为对抗环境设计的跟随正则化领导者(FTRL)算法,实际上也能最优地适应随机环境。然而,此类结果关键依赖于一个假设:存在唯一的最优臂。近期,Ito(2021)首次尝试针对采用$\frac{1}{2}$-Tsallis熵正则化器的特定FTRL算法,消除了这一不理想的唯一性假设。在本工作中,我们显著改进并推广了这一结果,证明对于一大类正则化器及新的学习率调度方案,唯一性假设并非必要。对于某些正则化器,即使唯一性成立,我们的遗憾界也优于先前结果。我们进一步将所得结果应用于解耦探索与利用问题,证明我们的技术具有广泛适用性。