We address the problem of best arm identification (BAI) with a fixed budget for two-armed Gaussian bandits. In BAI, given multiple arms, we aim to find the best arm, an arm with the highest expected reward, through an adaptive experiment. Kaufmann et al. (2016) develops a lower bound for the probability of misidentifying the best arm. They also propose a strategy, assuming that the variances of rewards are known, and show that it is asymptotically optimal in the sense that its probability of misidentification matches the lower bound as the budget approaches infinity. However, an asymptotically optimal strategy is unknown when the variances are unknown. For this open issue, we propose a strategy that estimates variances during an adaptive experiment and draws arms with a ratio of the estimated standard deviations. We refer to this strategy as the Neyman Allocation (NA)-Augmented Inverse Probability weighting (AIPW) strategy. We then demonstrate that this strategy is asymptotically optimal by showing that its probability of misidentification matches the lower bound when the budget approaches infinity, and the gap between the expected rewards of two arms approaches zero (small-gap regime). Our results suggest that under the worst-case scenario characterized by the small-gap regime, our strategy, which employs estimated variance, is asymptotically optimal even when the variances are unknown.
翻译:本文研究了固定预算下双臂高斯盗贼模型的最佳臂识别(BAI)问题。在BAI中,给定多个臂,我们旨在通过自适应实验找到期望奖励最高的最佳臂。Kaufmann等人(2016)推导了错误识别最佳臂概率的下界,并提出了假设奖励方差已知的策略,证明该策略在预算趋于无穷时,其错误识别概率与下界匹配,具有渐近最优性。然而,当方差未知时,渐近最优策略尚不明确。针对这一未解问题,我们提出一种策略:在自适应实验中估计方差,并按照估计标准差的比例抽取臂,称为奈曼分配(NA)-增广逆概率加权(AIPW)策略。进一步证明该策略具有渐近最优性:当预算趋于无穷且两臂期望奖励差距趋近于零(小间隙机制)时,其错误识别概率与下界匹配。结果表明,在小间隙机制刻画的最坏情形下,即使方差未知,采用方差估计的策略仍具有渐近最优性。