The linear bandit problem has been studied for many years in both stochastic and adversarial settings. Designing an algorithm that can optimize the environment without knowing the loss type attracts lots of interest. \citet{LeeLWZ021} propose an algorithm that actively detects the loss type and then switches between different algorithms specially designed for specific settings. However, such an approach requires meticulous designs to perform well in all environments. Follow-the-regularized-leader (FTRL) is another type of popular algorithm that can adapt to different environments. This algorithm is of simple design and the regret bounds are shown to be optimal in traditional multi-armed bandit problems compared with the detect-switch type. Designing an FTRL-type algorithm for linear bandits is an important question that has been open for a long time. In this paper, we prove that the FTRL algorithm with a negative entropy regularizer can achieve the best-of-three-world results for the linear bandit problem. Our regret bounds achieve the same or nearly the same order as the previous detect-switch type algorithm but with a much simpler algorithmic design.
翻译:线性赌博机问题在随机和对抗两种环境下已被研究多年。设计一种无需知晓损失类型就能优化环境的算法引起了广泛关注。\citet{LeeLWZ021}提出了一种主动检测损失类型并在针对特定场景设计的算法间切换的方法。然而,这类方法需要精心设计才能在所有环境中表现良好。跟随正则化领导者(FTRL)是另一种能够适应不同环境的主流算法。该算法设计简洁,且在传统多臂赌博机问题中,其遗憾界被证明优于检测-切换类算法。设计适用于线性赌博机的FTRL类算法是一个长期悬而未决的重要问题。本文证明,采用负熵正则化项的FTRL算法可在线性赌博机问题中实现三世界最优结果。我们的遗憾界与先前检测-切换类算法具有相同或近乎相同的量级,但算法设计却更为简洁。