We consider best arm identification in the multi-armed bandit problem. Assuming certain continuity conditions of the prior, we characterize the rate of the Bayesian simple regret. Differing from Bayesian regret minimization (Lai, 1987), the leading term in the Bayesian simple regret derives from the region where the gap between optimal and suboptimal arms is smaller than $\sqrt{\frac{\log T}{T}}$. We propose a simple and easy-to-compute algorithm with its leading term matching with the lower bound up to a constant factor; simulation results support our theoretical findings.
翻译:我们考虑多臂老虎机问题中的最佳臂识别。在假设先验满足特定连续性条件的前提下,我们刻画了贝叶斯简单遗憾的速率。与贝叶斯遗憾最小化(Lai, 1987)不同,贝叶斯简单遗憾的主导项源于最优臂与次优臂之间的间隙小于 $\sqrt{\frac{\log T}{T}}$ 的区域。我们提出了一种简单且易于计算的算法,其主导项与下界在常数因子内相匹配;仿真结果支持了我们的理论发现。