Much of the literature on optimal design of bandit algorithms is based on minimization of expected regret. It is well known that designs that are optimal over certain exponential families can achieve expected regret that grows logarithmically in the number of arm plays, at a rate governed by the Lai-Robbins lower bound. In this paper, we show that when one uses such optimized designs, the regret distribution of the associated algorithms necessarily has a very heavy tail, specifically, that of a truncated Cauchy distribution. Furthermore, for $p>1$, the $p$'th moment of the regret distribution grows much faster than poly-logarithmically, in particular as a power of the total number of arm plays. We show that optimized UCB bandit designs are also fragile in an additional sense, namely when the problem is even slightly mis-specified, the regret can grow much faster than the conventional theory suggests. Our arguments are based on standard change-of-measure ideas, and indicate that the most likely way that regret becomes larger than expected is when the optimal arm returns below-average rewards in the first few arm plays, thereby causing the algorithm to believe that the arm is sub-optimal. To alleviate the fragility issues exposed, we show that UCB algorithms can be modified so as to ensure a desired degree of robustness to mis-specification. In doing so, we also show a sharp trade-off between the amount of UCB exploration and the heaviness of the resulting regret distribution tail.
翻译:关于多臂赌博机算法最优设计的大量文献均基于期望遗憾的最小化。众所周知,在某些指数族上最优的设计能够实现遗憾随臂拉动次数呈对数增长,其增长率受Lai-Robbins下界控制。本文证明,当采用此类优化设计时,相应算法的遗憾分布必然具有极重的尾部,具体表现为截断柯西分布。此外,对于$p>1$,遗憾分布的$p$阶矩增长速度远快于多对数级别,实际上与臂拉动总次数的幂成正比。我们进一步证明,优化的UCB赌博机设计还存在另一种脆弱性:即使问题出现轻微设定错误,遗憾的增长速度也可能远超传统理论预期。我们的论证基于标准的测度变换思想,表明导致遗憾超出预期的最可能情形是:最优臂在初始几次拉动中返回低于平均水平的奖励,致使算法误判该臂为次优臂。为缓解所揭示的脆弱性问题,我们证明可以通过修改UCB算法来确保对设定错误具备特定程度的鲁棒性。在此过程中,我们还揭示了UCB探索强度与所得遗憾分布尾部厚度之间的尖锐权衡关系。