We consider the problem of best arm identification in the multi-armed bandit model, under fixed confidence. Given a confidence input $\delta$, the goal is to identify the arm with the highest mean reward with a probability of at least 1 -- $\delta$, while minimizing the number of arm pulls. While the literature provides solutions to this problem under the assumption of independent arms distributions, we propose a more flexible scenario where arms can be dependent and rewards can be sampled simultaneously. This framework allows the learner to estimate the covariance among the arms distributions, enabling a more efficient identification of the best arm. The relaxed setting we propose is relevant in various applications, such as clinical trials, where similarities between patients or drugs suggest underlying correlations in the outcomes. We introduce new algorithms that adapt to the unknown covariance of the arms and demonstrate through theoretical guarantees that substantial improvement can be achieved over the standard setting. Additionally, we provide new lower bounds for the relaxed setting and present numerical simulations that support their theoretical findings.
翻译:我们研究在固定置信度下多臂老虎机模型中的最优臂识别问题。给定置信度输入$\delta$,目标是以至少$1-\delta$的概率识别出具有最高均值回报的臂,同时最小化臂的拉取次数。尽管现有文献在假设臂分布独立的前提下提供了该问题的解决方案,但我们提出了一种更灵活的框架,其中臂可以存在依赖关系且回报可同时采样。该框架允许学习器估计各臂分布间的协方差,从而更高效地识别最优臂。我们提出的宽松设定在诸多应用中具有相关性,例如在临床试验中,患者或药物之间的相似性暗示了结果中存在的潜在相关性。我们引入了能够自适应未知臂协方差的新算法,并通过理论保证证明相较于标准设定可实现显著改进。此外,我们为该宽松设定提供了新的下界,并通过数值模拟支持了理论发现。