We study a distributed stochastic multi-armed bandit where a client supplies the learner with communication-constrained feedback based on the rewards for the corresponding arm pulls. In our setup, the client must encode the rewards such that the second moment of the encoded rewards is no more than $P$, and this encoded reward is further corrupted by additive Gaussian noise of variance $\sigma^2$; the learner only has access to this corrupted reward. For this setting, we derive an information-theoretic lower bound of $\Omega\left(\sqrt{\frac{KT}{\mathtt{SNR} \wedge1}} \right)$ on the minimax regret of any scheme, where $ \mathtt{SNR} := \frac{P}{\sigma^2}$, and $K$ and $T$ are the number of arms and time horizon, respectively. Furthermore, we propose a multi-phase bandit algorithm, $\mathtt{UE\text{-}UCB++}$, which matches this lower bound to a minor additive factor. $\mathtt{UE\text{-}UCB++}$ performs uniform exploration in its initial phases and then utilizes the {\em upper confidence bound }(UCB) bandit algorithm in its final phase. An interesting feature of $\mathtt{UE\text{-}UCB++}$ is that the coarser estimates of the mean rewards formed during a uniform exploration phase help to refine the encoding protocol in the next phase, leading to more accurate mean estimates of the rewards in the subsequent phase. This positive reinforcement cycle is critical to reducing the number of uniform exploration rounds and closely matching our lower bound.
翻译:我们研究了一种分布式随机多臂赌博机问题,其中客户端基于对应臂的奖励,向学习器提供通信受限的反馈。在我们的设定中,客户端必须对奖励进行编码,使得编码奖励的二阶矩不超过$P$,并且该编码奖励进一步受到方差为$\sigma^2$的加性高斯噪声的干扰;学习器仅能访问此受干扰的奖励。针对该设定,我们推导出任何方案的最小化遗憾的信息论下界为$\Omega\left(\sqrt{\frac{KT}{\mathtt{SNR} \wedge1}} \right)$,其中$\mathtt{SNR} := \frac{P}{\sigma^2}$,$K$和$T$分别为臂数和时间范围。此外,我们提出了一种多阶段赌博机算法$\mathtt{UE\text{-}UCB++}$,该算法与该下界仅相差一个小的加性因子。$\mathtt{UE\text{-}UCB++}$在初始阶段进行均匀探索,然后在最终阶段利用上置信界(UCB)赌博机算法。$\mathtt{UE\text{-}UCB++}$的一个有趣特征是:在均匀探索阶段形成的对均值奖励的粗略估计,有助于优化下一阶段的编码协议,从而使后续阶段对奖励均值的估计更精确。这一正反馈循环对于减少均匀探索轮次、并与我们的下界紧密匹配至关重要。