We study pure exploration with infinitely many bandit arms generated i.i.d. from an unknown distribution. Our goal is to efficiently select a single high quality arm whose average reward is, with probability $1-\delta$, within $\varepsilon$ of being among the top $\eta$-fraction of arms; this is a natural adaptation of the classical PAC guarantee for infinite action sets. We consider both the fixed confidence and fixed budget settings, aiming respectively for minimal expected and fixed sample complexity. For fixed confidence, we give an algorithm with expected sample complexity $O\left(\frac{\log (1/\eta)\log (1/\delta)}{\eta\varepsilon^2}\right)$. This is optimal except for the $\log (1/\eta)$ factor, and the $\delta$-dependence closes a quadratic gap in the literature. For fixed budget, we show the asymptotically optimal sample complexity as $\delta\to 0$ is $c^{-1}\log(1/\delta)\big(\log\log(1/\delta)\big)^2$ to leading order. Equivalently, the optimal failure probability given exactly $N$ samples decays as $\exp\big(-cN/\log^2 N\big)$, up to a factor $1\pm o_N(1)$ inside the exponent. The constant $c$ depends explicitly on the problem parameters (including the unknown arm distribution) through a certain Fisher information distance. Even the strictly super-linear dependence on $\log(1/\delta)$ was not known and resolves a question of Grossman and Moshkovitz (FOCS 2016, SIAM Journal on Computing 2020).
翻译:我们研究从未知分布独立同分布生成无限多臂老虎机的纯探索问题。目标是高效选择单个高质量臂,使其平均奖励以概率$1-\delta$位于前$\eta$-分位数臂的$\varepsilon$邻域内;这是经典PAC保证在无限动作集上的自然推广。我们分别考虑固定置信度和固定预算设置,旨在最小化期望样本复杂度与固定样本复杂度。针对固定置信度,我们提出一种期望样本复杂度为$O\left(\frac{\log (1/\eta)\log (1/\delta)}{\eta\varepsilon^2}\right)$的算法。该结果除$\log (1/\eta)$因子外达到最优,且$\delta$依赖关系弥补了文献中二次项差距。针对固定预算,我们证明当$\delta\to 0$时,渐进最优样本复杂度的主项为$c^{-1}\log(1/\delta)\big(\log\log(1/\delta)\big)^2$。等价地,给定$N$样本时的最优失败概率以$\exp\big(-cN/\log^2 N\big)$衰减,指数内因子为$1\pm o_N(1)$。常数$c$通过特定Fisher信息距离显式依赖于问题参数(包括未知臂分布)。即使$\log(1/\delta)$的严格超线性依赖此前亦未可知,该结果解决了Grossman与Moshkovitz提出的公开问题(FOCS 2016, SIAM Journal on Computing 2020)。