This paper studies the one-shot behavior of no-regret algorithms for stochastic bandits. Although many algorithms are known to be asymptotically optimal with respect to the expected regret, over a single run, their pseudo-regret seems to follow one of two tendencies: it is either smooth or bumpy. To measure this tendency, we introduce a new notion: the sliding regret, that measures the worst pseudo-regret over a time-window of fixed length sliding to infinity. We show that randomized methods (e.g. Thompson Sampling and MED) have optimal sliding regret, while index policies, although possibly asymptotically optimal for the expected regret, have the worst possible sliding regret under regularity conditions on their index (e.g. UCB, UCB-V, KL-UCB, MOSS, IMED etc.). We further analyze the average bumpiness of the pseudo-regret of index policies via the regret of exploration, that we show to be suboptimal as well.
翻译:本文研究了随机赌博机中无遗憾算法的单次运行行为。尽管许多算法在期望遗憾方面已知是渐近最优的,但在单次运行中,它们的伪遗憾似乎遵循两种趋势之一:平滑或波动。为衡量这一趋势,我们引入了一个新概念:滑动遗憾,它衡量在固定长度的滑动时间窗口内(随运行时间趋于无穷大)的最差伪遗憾。我们证明,随机方法(例如汤普森采样和MED)具有最优滑动遗憾,而指标策略(例如UCB、UCB-V、KL-UCB、MOSS、IMED等)尽管可能在期望遗憾方面是渐近最优的,但在其指标的规律性条件下具有最差的滑动遗憾。我们进一步通过探索遗憾分析了指标策略伪遗憾的平均波动性,并证明该探索遗憾也是次优的。