This work addresses a version of the two-armed Bernoulli bandit problem where the sum of the means of the arms is one (the symmetric two-armed Bernoulli bandit). In a regime where the gap between these means goes to zero as the number of prediction periods approaches infinity, i.e., the difficulty of detecting the gap increases as the sample size increases, we obtain the leading order terms of the minmax optimal regret and pseudoregret for this problem by associating each of them with a solution of a linear heat equation. Our results improve upon the previously known results; specifically, we explicitly compute these leading order terms in three different scaling regimes for the gap. Additionally, we obtain new non-asymptotic bounds for any given time horizon. Although optimal player strategies are not known for more general bandit problems, there is significant interest in considering how regret accumulates under specific player strategies, even when they are not known to be optimal. We expect that the methods of this paper should be useful in settings of that type.
翻译:本文研究了一种双臂伯努利赌博机问题,其中两臂均值之和为1(即对称双臂伯努利赌博机)。在臂间均值差距随预测周期数趋近于零而趋于零的机制下(即随着样本量增加,检测差距的难度增大),我们通过将极小极大最优遗憾与伪遗憾分别与线性热方程的解相关联,获得了该问题中这两者领先阶项的精确表达式。我们的结果改进了已有结论:具体而言,我们在差距的三种不同标度机制下显式计算了这些领先阶项。此外,我们还为任意给定时间范围推导了新的非渐近界。尽管更一般赌博机问题的最优玩家策略尚未可知,但即使是尚未证明为最优的特定玩家策略下,遗憾如何累积仍具有重要的研究价值。我们期望本文的方法能够为该类问题的研究提供有效工具。