We study the real-valued combinatorial pure exploration of the multi-armed bandit (R-CPE-MAB) problem. In R-CPE-MAB, a player is given $d$ stochastic arms, and the reward of each arm $s\in\{1, \ldots, d\}$ follows an unknown distribution with mean $\mu_s$. In each time step, a player pulls a single arm and observes its reward. The player's goal is to identify the optimal \emph{action} $\boldsymbol{\pi}^{*} = \argmax_{\boldsymbol{\pi} \in \mathcal{A}} \boldsymbol{\mu}^{\top}\boldsymbol{\pi}$ from a finite-sized real-valued \emph{action set} $\mathcal{A}\subset \mathbb{R}^{d}$ with as few arm pulls as possible. Previous methods in the R-CPE-MAB assume that the size of the action set $\mathcal{A}$ is polynomial in $d$. We introduce an algorithm named the Generalized Thompson Sampling Explore (GenTS-Explore) algorithm, which is the first algorithm that can work even when the size of the action set is exponentially large in $d$. We also introduce a novel problem-dependent sample complexity lower bound of the R-CPE-MAB problem, and show that the GenTS-Explore algorithm achieves the optimal sample complexity up to a problem-dependent constant factor.
翻译:我们研究了多臂老虎机中的实值组合纯探索问题(R-CPE-MAB)。在R-CPE-MAB中,玩家拥有$d$个随机臂,每个臂$s\in\{1, \ldots, d\}$的奖励服从均值为$\mu_s$的未知分布。在每个时间步,玩家拉动单个臂并观察其奖励。玩家的目标是从有限大小的实值“动作集”$\mathcal{A}\subset \mathbb{R}^{d}$中,以尽可能少的臂拉动次数,识别出最优动作$\boldsymbol{\pi}^{*} = \argmax_{\boldsymbol{\pi} \in \mathcal{A}} \boldsymbol{\mu}^{\top}\boldsymbol{\pi}$。以往的R-CPE-MAB方法假设动作集$\mathcal{A}$的大小是$d$的多项式函数。我们提出了一种名为广义Thompson采样探索(GenTS-Explore)的算法,这是首个即使动作集大小随$d$呈指数增长也能正常工作的算法。我们还引入了R-CPE-MAB问题的一个全新的问题相关样本复杂度下界,并证明GenTS-Explore算法能够达到最优样本复杂度(仅相差一个问题相关的常数因子)。