Given a set of arms $\mathcal{Z}\subset \mathbb{R}^d$ and an unknown parameter vector $\theta_\ast\in\mathbb{R}^d$, the pure exploration linear bandit problem aims to return $\arg\max_{z\in \mathcal{Z}} z^{\top}\theta_{\ast}$, with high probability through noisy measurements of $x^{\top}\theta_{\ast}$ with $x\in \mathcal{X}\subset \mathbb{R}^d$. Existing (asymptotically) optimal methods require either a) potentially costly projections for each arm $z\in \mathcal{Z}$ or b) explicitly maintaining a subset of $\mathcal{Z}$ under consideration at each time. This complexity is at odds with the popular and simple Thompson Sampling algorithm for regret minimization, which just requires access to a posterior sampling and argmax oracle, and does not need to enumerate $\mathcal{Z}$ at any point. Unfortunately, Thompson sampling is known to be sub-optimal for pure exploration. In this work, we pose a natural question: is there an algorithm that can explore optimally and only needs the same computational primitives as Thompson Sampling? We answer the question in the affirmative. We provide an algorithm that leverages only sampling and argmax oracles and achieves an exponential convergence rate, with the exponent being the optimal among all possible allocations asymptotically. In addition, we show that our algorithm can be easily implemented and performs as well empirically as existing asymptotically optimal methods.
翻译:给定臂集 $\mathcal{Z}\subset \mathbb{R}^d$ 和未知参数向量 $\theta_\ast\in\mathbb{R}^d$,纯探索线性臂问题旨在通过 $x\in \mathcal{X}\subset \mathbb{R}^d$ 对 $x^{\top}\theta_{\ast}$ 的含噪测量,以高概率返回 $\arg\max_{z\in \mathcal{Z}} z^{\top}\theta_{\ast}$。现有(渐近)最优方法要么需要 a) 为每个臂 $z\in \mathcal{Z}$ 进行潜在昂贵的投影,要么 b) 在每个时刻显式维护一个正在考虑的 $\mathcal{Z}$ 子集。这种复杂性与用于遗憾最小化的流行且简单的汤普森采样算法相悖,后者仅需访问后验采样和 argmax 预言机,且无需在任何时刻枚举 $\mathcal{Z}$。遗憾的是,已知汤普森采样在纯探索问题上是次优的。本文提出一个自然问题:是否存在一种能够最优探索且仅需与汤普森采样相同计算原语的算法?我们给出了肯定答案。我们提出了一种仅利用采样和 argmax 预言机的算法,实现了指数收敛速率,其指数在渐近意义上达到了所有可能分配中的最优值。此外,我们的算法易于实现,在实证中与现有渐近最优方法表现相当。