We study the problem of designing adaptive multi-armed bandit algorithms that perform optimally in both the stochastic setting and the adversarial setting simultaneously (often known as a best-of-both-world guarantee). A line of recent works shows that when configured and analyzed properly, the Follow-the-Regularized-Leader (FTRL) algorithm, originally designed for the adversarial setting, can in fact optimally adapt to the stochastic setting as well. Such results, however, critically rely on an assumption that there exists one unique optimal arm. Recently, Ito (2021) took the first step to remove such an undesirable uniqueness assumption for one particular FTRL algorithm with the $\frac{1}{2}$-Tsallis entropy regularizer. In this work, we significantly improve and generalize this result, showing that uniqueness is unnecessary for FTRL with a broad family of regularizers and a new learning rate schedule. For some regularizers, our regret bounds also improve upon prior results even when uniqueness holds. We further provide an application of our results to the decoupled exploration and exploitation problem, demonstrating that our techniques are broadly applicable.
翻译:我们研究设计自适应多臂老虎机算法的问题,这类算法需同时在随机环境和对抗性环境下表现最优(常称为“两全其美”保证)。近期系列研究表明,当恰当地配置和分析时,原本为对抗性环境设计的“跟随正则化领导者”(FTRL)算法,事实上也能在随机环境下实现最优适应。然而,此类结果严重依赖于一个假设:存在唯一的最优臂。近期,Ito(2021)率先针对采用$\frac{1}{2}$-Tsallis熵正则化项的特定FTRL算法,消除了这一不理想唯一性假设。在本工作中,我们显著改进并推广了这一结果,证明对于广泛的正则化项族和新的学习率调度方案,唯一性假设并非必需。对于某些正则化项,即使唯一性成立,我们的遗憾界也优于先前结果。我们还进一步将研究结果应用于解耦探索与利用问题,表明我们的技术具有广泛适用性。