We present improved algorithms with worst-case regret guarantees for the stochastic linear bandit problem. The widely used "optimism in the face of uncertainty" principle reduces a stochastic bandit problem to the construction of a confidence sequence for the unknown reward function. The performance of the resulting bandit algorithm depends on the size of the confidence sequence, with smaller confidence sets yielding better empirical performance and stronger regret guarantees. In this work, we use a novel tail bound for adaptive martingale mixtures to construct confidence sequences which are suitable for stochastic bandits. These confidence sequences allow for efficient action selection via convex programming. We prove that a linear bandit algorithm based on our confidence sequences is guaranteed to achieve competitive worst-case regret. We show that our confidence sequences are tighter than competitors, both empirically and theoretically. Finally, we demonstrate that our tighter confidence sequences give improved performance in several hyperparameter tuning tasks.
翻译:我们针对随机线性bandit问题提出了具有最坏情况遗憾保证的改进算法。广泛应用的"面对不确定性保持乐观"原则将随机bandit问题转化为对未知奖励函数构造置信序列的问题。所得bandit算法的性能取决于置信序列的大小:更小的置信集能带来更好的实证表现和更强的遗憾保证。在本工作中,我们利用自适应鞅混合的新颖尾界构造了适用于随机bandit的置信序列。这些置信序列可通过凸规划实现高效的动作选择。我们证明基于所构造置信序列的线性bandit算法能够保证达到具有竞争力的最坏情况遗憾。实验和理论结果均表明,我们的置信序列比现有方法更紧凑。最后,我们证明在多个超参数调优任务中更紧凑的置信序列能带来更优的性能表现。