In this paper, we consider a bandit problem in which there are a number of groups each consisting of infinitely many arms. Whenever a new arm is requested from a given group, its mean reward is drawn from an unknown reservoir distribution (different for each group), and the uncertainty in the arm's mean reward can only be reduced via subsequent pulls of the arm. The goal is to identify the infinite-arm group whose reservoir distribution has the highest $(1-\alpha)$-quantile (e.g., median if $\alpha = \frac{1}{2}$), using as few total arm pulls as possible. We introduce a two-step algorithm that first requests a fixed number of arms from each group and then runs a finite-arm grouped max-quantile bandit algorithm. We characterize both the instance-dependent and worst-case regret, and provide a matching lower bound for the latter, while discussing various strengths, weaknesses, algorithmic improvements, and potential lower bounds associated with our instance-dependent upper bounds.
翻译:本文研究了一类多臂老虎机问题,其中包含若干组,每组均由无穷多个臂构成。每当从特定组中请求一个新臂时,其平均奖励将从该组独有的未知储备分布中抽取,且该臂平均奖励的不确定性仅能通过后续对该臂的拉动来降低。目标是在尽可能减少总臂拉动次数的前提下,识别出储备分布具有最高$(1-\alpha)$-分位数(例如,当$\alpha = \frac{1}{2}$时即为中位数)的无限臂组。我们提出了一种两步算法:首先从每个组中请求固定数量的臂,随后运行一个有限臂分组最大分位数老虎机算法。我们刻画了实例相关与最坏情况下的遗憾,并为后者提供了匹配的下界,同时讨论了与实例相关上界相关的各种优势、缺陷、算法改进以及潜在下界。