We study the problem of identifying the best arm in a stochastic multi-armed bandit game. Given a set of $n$ arms indexed from $1$ to $n$, each arm $i$ is associated with an unknown reward distribution supported on $[0,1]$ with mean $\theta_i$ and variance $\sigma_i^2$. Assume $\theta_1 > \theta_2 \geq \cdots \geq\theta_n$. We propose an adaptive algorithm which explores the gaps and variances of the rewards of the arms and makes future decisions based on the gathered information using a novel approach called \textit{grouped median elimination}. The proposed algorithm guarantees to output the best arm with probability $(1-\delta)$ and uses at most $O \left(\sum_{i = 1}^n \left(\frac{\sigma_i^2}{\Delta_i^2} + \frac{1}{\Delta_i}\right)(\ln \delta^{-1} + \ln \ln \Delta_i^{-1})\right)$ samples, where $\Delta_i$ ($i \geq 2$) denotes the reward gap between arm $i$ and the best arm and we define $\Delta_1 = \Delta_2$. This achieves a significant advantage over the variance-independent algorithms in some favorable scenarios and is the first result that removes the extra $\ln n$ factor on the best arm compared with the state-of-the-art. We further show that $\Omega \left( \sum_{i = 1}^n \left( \frac{\sigma_i^2}{\Delta_i^2} + \frac{1}{\Delta_i} \right) \ln \delta^{-1} \right)$ samples are necessary for an algorithm to achieve the same goal, thereby illustrating that our algorithm is optimal up to doubly logarithmic terms.
翻译:我们研究在随机多臂老虎机游戏中识别最佳臂的问题。给定一组编号从1到$n$的臂,每个臂$i$关联一个未知的奖励分布,该分布支撑在$[0,1]$上,均值为$\theta_i$,方差为$\sigma_i^2$。假设$\theta_1 > \theta_2 \geq \cdots \geq\theta_n$。我们提出一种自适应算法,该算法探索各臂奖励的间隙和方差,并基于收集到的信息,采用一种名为“分组中位数消除”的新方法做出后续决策。所提算法保证以概率$(1-\delta)$输出最佳臂,且最多使用$O \left(\sum_{i = 1}^n \left(\frac{\sigma_i^2}{\Delta_i^2} + \frac{1}{\Delta_i}\right)(\ln \delta^{-1} + \ln \ln \Delta_i^{-1})\right)$个样本,其中$\Delta_i$($i \geq 2$)表示臂$i$与最佳臂之间的奖励间隙,我们定义$\Delta_1 = \Delta_2$。在某些有利场景下,该算法相比与方差无关的算法具有显著优势,并且是首个相对于现有最优结果去除了最佳臂上额外$\ln n$因子的成果。我们进一步证明,算法要实现相同目标,至少需要$\Omega \left( \sum_{i = 1}^n \left( \frac{\sigma_i^2}{\Delta_i^2} + \frac{1}{\Delta_i} \right) \ln \delta^{-1} \right)$个样本,从而表明我们的算法在双对数项意义下达到最优。