We study best arm identification in a variant of the multi-armed bandit problem where the learner has limited precision in arm selection. The learner can only sample arms via certain exploration bundles, which we refer to as boxes. In particular, at each sampling epoch, the learner selects a box, which in turn causes an arm to get pulled as per a box-specific probability distribution. The pulled arm and its instantaneous reward are revealed to the learner, whose goal is to find the best arm by minimising the expected stopping time, subject to an upper bound on the error probability. We present an asymptotic lower bound on the expected stopping time, which holds as the error probability vanishes. We show that the optimal allocation suggested by the lower bound is, in general, non-unique and therefore challenging to track. We propose a modified tracking-based algorithm to handle non-unique optimal allocations, and demonstrate that it is asymptotically optimal. We also present non-asymptotic lower and upper bounds on the stopping time in the simpler setting when the arms accessible from one box do not overlap with those of others.
翻译:我们研究多臂老虎机问题变体中的最优臂识别问题,其中学习者在臂选择过程中存在精度限制。学习者只能通过特定的探索束(称为盒)对臂进行采样。具体而言,在每个采样时刻,学习者选择一个盒,该盒会依据其专属的概率分布触发某个臂被拉取。被拉取的臂及其即时奖励将被反馈给学习者,其目标是在满足错误概率上界约束的条件下,通过最小化期望停止时间找到最优臂。我们给出了期望停止时间的渐近下界,该下界在错误概率趋近于零时成立。研究表明,该下界所暗示的最优分配通常不唯一,因此难以追踪。我们提出了一种改进的基于追踪的算法来处理非唯一的最优分配,并证明该算法具有渐近最优性。此外,在单盒可访问臂互不相交的简化场景下,我们给出了停止时间的非渐近上下界。